CN101727512A

CN101727512A - General algorithm based on variation multiscale method and parallel calculation system

Info

Publication number: CN101727512A
Application number: CN200810224328A
Authority: CN
Inventors: 葛蔚; 李静海
Original assignee: Institute of Process Engineering of CAS
Current assignee: Institute of Process Engineering of CAS
Priority date: 2008-10-17
Filing date: 2008-10-17
Publication date: 2010-06-09
Anticipated expiration: 2028-10-17
Also published as: CN101727512B

Abstract

The invention discloses a computer soft hardware system based on a variation multiscale method, belonging to the technical field of high-performance computers. The computer soft hardware system provides a behavior of describing a system bottom layer with a discrete model unit and a calculation mode of describing a stability condition observed by a large quantity of model unit group behaviors with variation restraint aiming at the multiscale structure and the discrete essence of a complex system in the actual world, thereby saving a large amount of calculated amount compared with simple bottom layer discrete simulation and outstandingly enhancing the calculation accuracy compared with single-level equalization simulation. Multilevel calculation hardware with the design directly reflects the calculation mode, adopts the interaction among a large quantity of simple processor calculation model units and acts on the variation restraint of the large quantity of simple processor calculation model units with complex general processor calculation, thereby outstandingly enhancing the resolving speed and the calculating scale of similar problems.

Description

A kind of general-purpose algorithm and concurrent computational system based on the multiple dimensioned method of variation

Technical field

The present invention relates to high-performance computer numerical simulation technology field, relate in particular to a kind of general-purpose algorithm and concurrent computational system based on the multiple dimensioned method of variation.

Background technology

Computer simulation has become with theory and has tested the third scientific research arranged side by side and the basic skills of technological development, also is the main application direction of high-performance calculation.But the development of high performance computing system is at present mainly still driven by the development of hardware technology, and the common feature with computer simulation is not that basic point of departure designs computer system thereof.Therefore, along with the components and parts technology that with the integrated circuit live width is sign is approached the limit under the present condition gradually, the bottleneck that further improves computing power seems more and more outstanding, and this mainly shows following several aspect:

1) gap between the peak velocity of computing machine and actual computation ability is increasing.The main flow mode that realizes high-performance calculation at present is massively parallel processing (MPP), because such system is costly, how making full use of its hardware resource is the problem that needs emphasis to consider in the MPP design.Traditional thinking is to allow system can adapt to many different algorithms and application problem, promptly relies on versatility to guarantee the full of portfolio, and system resource is fully utilized.For this reason, should realize fast data exchange of overall importance in principle, comprise between processor and the storer and direct or indirect exchanges data between the processor.Under such mentality of designing, when processor quantity increases, the hardware spending of communication aspects is inevitable non-linearly to be increased, and the actual speed of system but can't reach and the linear growth of processor quantity conversely, thereby has caused the main bottleneck that improves machine performance.The scale of system is owing to there is a limit in the integrated level that is subjected to components and parts serious restriction technically.Even at present concerning the microcomputer of uniprocessor, gaps between their growth rates between the data processing speed of processor and the memory data access speed widen the actual efficiency that also makes it move many calculation procedures and have only about 10% of its peak value, have only 1～2% and usually more be reduced to concerning mpp system.

2) complexity of the precision of computing method and simulated object is more and more incompatible.Present computer simulation method is based on reductionism generally, think that promptly simulated object can the effect rule between the unit be described by a cover equation or a cover, if have enough computing powers to find the solution this cover equation or carry out this cover rule, just can reappear and the behavior of prognosis modelling object.Do not set up even do not consider above-mentioned judging whether, in large-scale calculations, such as in the long-time simulation to big system, inevasible step-length and word length error also make the feasible reliability of result of calculation be difficult to guarantee.Be difficult to accurately estimate the error of actual computation process purely from mathematical theory of computation.And at present a lot of analog results also are difficult to carry out effectively and comprehensively check with experimental technique, and it is exactly because be difficult to experimentize research that in fact a lot of systems must simulate.The phenomenon of simulated object and simulation is complicated more, and the problem of this respect is also just outstanding more.

3) popularity of high-performance calculation is with increasing to the demand gap of high-performance calculation.As the third research and development means outside theoretical and the experiment, high-performance calculation can not be confined to a few computing center, and need come into the laboratory in a large number, comes into incorporated business, comes into school classroom.And along with the continuous development of social production life, to the demand of high-performance calculation also head and shoulders above the category that calculates of scientific and engineering, " complex gigantic system " such as society, economy, finance, ecologies all begins to become the object that needs quantitative test, simulation, otherwise unscientific decision-making may bring loss difficult to the appraisal and disaster.But cost, energy consumption, occupation of land and the maintenance cost of high performance computing system have all reached very considerable degree at present, efficient programming and the difficulty of using are also very big, a large amount of potential users are hung back, become a big obstacle of high-performance computer and application development thereof.Therefore, the high performance computing system of setting up more economical practicality has become a urgent problem.

In order to break through above-mentioned bottleneck, we notice that multiple dimensioned structure and discrete essence are the common traits of most of simulated object, and through research for many years, various control mechanism is coordinated and the stability condition that forms is the Fundamentals of the multiple dimensioned structure behavior of decision mutually in our the discovery system.Thus, we find progressively that by case study following method has ubiquity to the simulation of complication system with multiple dimensioned structure, that is:

1) suitably on the yardstick system is being dispersed to having the stackable interactional naive model of short range unit in a large number;

2) the interaction except that between said units, they also are subjected to the constraint of one or more variations or extremum conditions, thus different behavior when having with independently moving;

Therefore also be decided by the to be tied behavior of unit of the form of the constraint that 3) applies can be provided with higher level, complicated model unit more, embodies this constraint-feedback mechanism by the interaction with the lower level unit;

4) relation between said units can be nested, thereby form multi-level computation model.

At said method, we can design multi-level short range connect, from top to bottom by numerous to computing unit system simple, from less to more, make between being connected between effect and the computing unit between computing unit and model unit, model unit and set up suitable mapping relations, thereby bring into play the performance of computing hardware to greatest extent, reduce unnecessary hardware spending.Simultaneously, adopt this method, according to simulated object stability condition physically, we can also come the corrected Calculation error to the constraint of lower floor unit by upper unit, guarantee the precision of calculating from mechanism.

In Chinese patent application 200510064799.1,200710099551.8 and in 200810057259.4, we are primarily aimed between model unit in the said method and act on, and promptly so-called " particle method " proposed the different designs scheme of a kind of general-purpose algorithm and dedicated hardware systems thereof.The present invention will propose the algorithm of broad sense more and the more perfect multiple dimensioned method of variation and the General layout Plan of dedicated hardware systems thereof on this basis, and emphasis solves the implementation of variation or extreme value constraint and the differentiation design and the coupling scheme of different levels computing unit.For setting forth motivation of the present invention, meaning, technical scheme and application prospect better, we at first simply introduce the multiple dimensioned and discrete analog method of variation, analyze existing relevant special-purpose software and hardware system then, mainly be the state of the art and the development trend of particle simulation system, thereby the design philosophy of our previous several patents, the problem of existence are described, introduce the present invention at last.

One) the multiple dimensioned method brief introduction of variation

Here the multiple dimensioned method of so-called variation mainly is meant a kind of complication system evolution analogy method with stability condition closed power system of equations that we develop.General multiple dimensioned method can be divided into description type and two kinds on related type: the former is the describing method of coupling different scale in the space and/or on the time, as describe the behavior of its block with Finite Element Method with near the behavior the molecular dynamics method simulation material micro-crack.And the latter is the statistical correlation formula that will obtain from the small scale simulation model equation as the large scale simulation, the ingredient that can be used as the fluid mechanics equation group as the fluid state equation that obtains from molecular dynamics simulation and constitutive relation.But these two kinds of methods all do not have explicitly to consider under nonequilibrium condition, in fact large scale behavior as small scale unit collective behavior can emerge in large numbers the characteristic that makes new advances, and to small scale behavior generation constraint, and on the other hand the small scale model so long as not from first principle, not closure will explicit or impliedly be had, such as turbulence model.

And the multiple dimensioned method of variation is thought, for non-linear nonequilibrium system, plural controlling mechanism must be arranged in action in the system.Though these mechanism of different systems are not quite similar mostly, can be expressed as certain extreme value or variation condition separately.And by analyzing each controlling mechanism independently extreme value or variation condition separately, and the rule coordinated mutually of these conditions, the stability condition that we can the proposition system need satisfy.This stability condition just in time provides the sealing condition that lacks in the kinetic description.At present, for the polyphasic flow system, we have found dynamic multiple dimensioned stability of structure condition in the multiple systems such as gas-solid, gas-liquid, liquid liquid.Use these conditions and just can seal by yardstick and decompose on each yardstick obtain separately kinetics equation, and realize striding the association of yardstick.Then by mathematical multiple-objection optimization, can find the solution the sub-micro meta-model of multi-scale coupling, and the mechanism that multiple dimensioned structure produces in the understanding complication system and space-time develops, in the rules such as sudden change of critical condition, grasp the generation of complication system and the essence of evolution.

Two) discrete analog method brief introduction

The discrete analog method provides a kind of essence of complex systematic dynamics behavior, the pervasive and easy mode described.Wherein most ofly be a large amount of interactional particles with simulation system is discrete, describe the behavior of each particle by dynamics calculation, thereby directly or by the behavior of statistics with combination reproduction system, thereby many particle methods that is also referred to as.Their representative instance comprises:

Molecular dynamics (molecular dynamics, MD).Atom, atomic group or molecule be reduced to by interactional particles of mode such as potent and rigid constraints describe molecule, molecular group so that the microscopic behavior of material, be widely used at present the synthetic of chemicals, the research of biomacromolecule and new material, design and preparation are to the fields such as exploration of life quintessence.And in a broad sense, the simulation of nuclear radiation also is included in the molecular dynamics method as neutron diffusion etc.

Discrete element method (distinct element method, DEM).To solid particulate matters such as picture sandstone, cereal, various powder, wherein the interaction force between each discrete particle that exists naturally (as is in contact with one another the pressure and the friction force of generation, and the electrostatic force that also can exist during noncontact etc.), and and then calculate their tracks separately, this is called as discrete element method.At present also in industrial process, agricultural engineering and aspect widespread uses such as geology, the hydrology.

Many-body dynamics (N-body dynamics).See on the yardstick at space, from celestial body, galaxy, to the cluster of galaxies even whole universe, the discrete characteristic in the world also is very tangible, and the latter can be considered the particle of forming the former.Many-body dynamics is followed the tracks of their track and collective behavior by calculating universal gravitation between these huge " particles ", is a kind of main flow means of astrodynamics simulation.This method provides powerful measure for the formation of exploration of the universe and evolution and following space industry.

Coarse model (Coarse-grained models).On also being not limited to intuitively, particle method can be treated to the system of particle assembly.In recent years, for fluid flow and distortion of materials etc. adopts the behavior of continuous medium method simulation traditionally, by the model particle structure coarse or that simplify a lot of particle methods have been proposed also.As dissipation particle dynamics (the dissipative particle dynamics that is situated between and sees, DPD) method and grid Boltzmann (Lattice Boltzmann, LB) method, and the smooth particle dynamics of macroscopic view (smoothed particle hydrodynamics, SPH) method etc.On physical background, these model particle roughly are that the Lagrange (Lagrange) of an a glob of molecule or a material infinitesimal is expressed.The problem that the number of particles that the calculated amount that broken through these models contains naturally with system must increase (this is the important reason that adopts the continuous medium method), and be particularly suitable for handling complex boundary, multiphase medium and large deformation etc. to the challenging problem of continuous medium method, at present on the naval vessel, the design of aircraft and vehicle, the research of nuclear weapon and reactor and design, the energy, chemical industry, water conservancy, geological exploration and development obtain in the extensive fields such as meteorology and marine forecasting to widely apply.

In fact, much the explicit numerical evaluation model of continuum Model also can be regarded as the model of discrete analog from the angle of physics.And the so-called Agent model of using always in the simulation of society, economic dispatch system also can be thought the discrete analog model of relative complex.Therefore, the coverage rate of discrete analog is quite widely.

Two) at the special-purpose software and hardware system of different discrete models

Common software and dedicated hardware at the multiple dimensioned method of variation also do not occur at present, but in the world different particle methods proposed some special-purpose software and hardware systems.Morning has proposed a kind of processing unit with many input/output port as U.S. Pat 4740894 (1988-04-26 is open), U.S. Pat 3970993 (1976-07-20 is open) then adopts unidirectional chain passage (ChainingChannel) that processing unit is together in series, and makes data can pass to next processing unit.Such unit can be used for forming the processor array that is fit to some simple particle algorithms.More typical example of this respect is the method and system of particle movement on the employing combinational logic (combination logic) that proposes of U.S. Pat 5432718 (1995-07-11 is open) and the double-grid computation rule grid, its corresponding LGA is very efficiently, by appropriate reconstruction, also applicable to some other particle methods such as LBM based on grid.But the limitation of processor array is also apparent.Each processing unit can only be handled predefined a few computing of its hardware in principle, thereby and do not possess the function of storage and interpretive order independent operating program, so its versatility is very poor.

In recent years IBM and Japanese physics and chemistry research institute (Institute of Chemical Research, RIKEN) MD-GRAPE (abbreviation of Molecular Dynamics GRAvity PipE) family chip, integrated circuit board and the special purpose computer that is specifically designed to N-body problem and molecular dynamics simulation researched and developed in cooperation.They with in these problems between typical particle the effect algorithm be cured as special hardware pipeline, each chip have a large amount of can parallel processing streamline and every streamline can a plurality of interparticle effects of parallel processing.The streamline in later stage has also adopted programmable gate array (FPGA) device with according to the different hardware pipeline of effect algorithm reconstruct between different particles, improves its efficient and versatility.

The predecessor of MD-GRAPE chip is the GRAPE chip that once obtained the fastest computing machine Gordon of IEEE ComputerSociety Bell prize in 1995 and 1,996 two, the MD-GRAPE chip is through optimal design, be used for calculating the calculating of the particle interphase interaction power in N-body problem and the simulation of molecule subdynamics specially, its inside provides 4 parallel calculating streamlines, every streamline can carry out the interaction force calculating of 6 particles, the information that chip can be stored 1,000,000 particles simultaneously simultaneously.MD-GRAPE cooperates the computer system that has made up the 100Tflops/s speed that is specifically designed to N-body problem and molecular dynamics simulation with other special chips that RIKEN designed afterwards.

The user uses powerful N-body problem and the molecular dynamics simulation dedicated computing ability of MD-GRAPE for convenience, also designed the MD-GRAPE integrated circuit board that can directly be inserted into the computing machine expansion slot, integrated circuit board adopts pci interface, can directly be inserted into from user's personal computer until IBM RS6000SP high-performance computer.Integrated circuit board is integrated 4 MD-GRAPE chips provide the computing power of 64Gflops/s on PC or RISC workstation, and provide FORTRAN and C programmer bank interface for the user.

Second generation MD-GRAPE chip is integrated 9,000,000 transistors adopt 0.25 μ m technology, the 2.5V technology of IBM Corporation, and the power under the 100M dominant frequency is 15 watts, is 4 calculating streamlines equally, and the computing power of 64Gflops is provided.Third generation MD-GRAPE chip has 20 calculating streamlines, and the difference of trial run frequency can provide 165 or the computing power of 200Gflops.The MD3-PCIX integrated circuit board is integrated two MD-GRAPE3 chips, computing power is 330Gflops, can directly be inserted in the subscriber computer to use.

Except chip and integrated circuit board, also have the MD-GRAPE-3 dedicated processor, processor is integrated 12 MD-GRAPE3 chips, its external dimensions is 2U, 19 inches, the private communication line by 10Gbits/s is connected with the interface card that is inserted in host computer PCIX slot.The computing power of each MD-GRAPE-3 chip is 200Gflops, and the calculating peak value of whole computing machine has reached 1Pflops, and the power of computing machine is 300kW, has realized the high-performance of calculating under the low-power consumption prerequisite.

QCDOC is the special chip that quantum chromodynamics QCD (QuantumChromoDynamics) calculates that is specifically designed to of IBM development, and the Duo Jia laboratory adopts the QCDOC chip to set up the QCDOC special purpose computer.

The QCDOC chip is a kind of dedicated IC chip of developing on the basis of PowerPC kernel (ASIC).It comprises a 500MHz 440PowerPC processor, 64 Floating-point Computation abilities of 1Gflops are provided, the integrated storer EDRAM of 4Mbytes, be used for storage code and data when carrying out standard lattice QCD calculating, it is the data transmission path of 8GByte/s that calculating inner core has a peak bandwidth to EDRAM.Simultaneously ASIC possesses the DMA function, can transmit data between EDRAM and external memory storage automatically, and communicates by letter between the support node, comprise one be used to guide, the Ethernet controller of diagnosis and I/O network.

The QCDOC special purpose computer is one and calculates the parallel computer that node adopts the QCDOC chip, calculate the torus interconnected that adopts 6 dimension mesh between node, 12 neighbours' nodes of each node and its link to each other with the speed of full duplex 500Mbit/s, connection line adopts phase-locked receiver, possess the automatic functions of retransmission of single bit error-detecting, can carry out the direct dma access of neighborhood of nodes internal memory.The QCDOC special purpose computer also has a quick Ethernet of 100Mbit/s to be used to guide, diagnose and manage the general I/O of usefulness in addition.The QCDOC special purpose computer uses to QCD and calculates custom-designed QOS operating system.

Have 3 QCDOC special purpose computers to be installed in Columbia University (1024 node), RIKEN BNL Research Center (12288 node) and DOE BrookhavenNational Laboratory respectively at present, the speed of wherein back two QCDOC computing machines has all reached 10Tflops.

Tilera company issued 64 core processor Tile64 of a employing 64nm technology in 2007, and frequency of operation is 600～900MHz.In existing polycaryon processor, mainly communicate by letter between each kernel by bus, if 16 or more kernels are arranged, the data rate of bus just will become bottleneck, thus performance of processors also will be had a greatly reduced quality.Tile64 does not have central bus, but each kernel is directly linked to each other, and has avoided the speed bottle-neck of existing processor architecture effectively, and can more move under the low-power consumption.In addition, each kernel of Tile64 all is the processor of a telotism, can operating system of isolated operation.Figure below is the internal structure synoptic diagram of Tile64 chip.

Tilera ' s Multicore Development Environment (MDE) is the development environment that is used for the Tile64 chip.Tilera company provides the integrated circuit board of two kinds of models at present, TILExpress-64 and TILExpress-20G, and they are mainly used in application such as multimedia streaming data processing and network traffics detection.

Three) early-stage Study of the present invention

Above-mentioned several system has considered the characteristics of particle method in some aspects, short range connection (Tile64 and QCDOC) and dedicated stream computing technique (MD-Grape) have been adopted, but the design proposal at the general software and hardware system of discrete analog is not proposed as yet, certainly yet do not consider the constraint of multiple dimensioned discrete general character of actual complex system and the variation between the different levels analogue unit, realize the coupling of multiple dimensioned algorithm and multi-level architecture.

For the remarkable range of application that reduces the calculated amount of direct discrete analog and enlarge system of aspect at computation model and software, realize that the multithreading shared drive of present main flow and flowmeter calculate combining of parallel computing and enhanced scalability short range internetwork connection mode, making full use of their advantages separately also can learn from other's strong points to offset one's weaknesses, in Chinese patent application 200510064799.1,200710099551.8 and in 200810057259.4, we have summed up the following common trait of discrete analog, and have proposed the corresponding calculated system design scheme.

At first, no matter the model unit that we considered is the particle of nature existence or the model particle of arteface, so that a lot of complicated Agent, the action intensity between them is rapid reduction the with the growth of distance (or certain logical reach) generally.Effect between physical particles is nothing but (be in fact three kinds or still less) that four kinds of fundamental forces cause in essence, wherein distance square is inversely proportional between gravitation and electromagnetism intensity of force and particle, and the decay of strong and weak interaction is faster, therefore generally can ignore at a distance of interparticle effect enough far away, perhaps by estimating the every pair of interparticle Force Calculation of making a concerted effort to replace of a large amount of particles.This has just caused locality, although promptly total system can have any a plurality of model unit, the model unit that directly determines arbitrary model unit transient motion mainly is very a spot of contiguous model unit.

Simultaneously, the action function between a pair of model unit generally can be described by one or one group of algebraically or ordinary differential equation, but and model unit is subjected to simultaneously that each is a superposition to effect.That is to say that we can distinguish the effect between every pair of unit of independent processing in any order, by simply adding and obtaining the suffered overall function in unit.Though retraining the composite particle of forming (as the macromolecule of chain) by some to the hard sphere particle or by a plurality of particles, and be not so simple in the concrete processing of society, some Agent in the economic system, but on big slightly yardstick, as the integral body to composite particle, its algorithm still has this character generally.

Can find that thus the discrete analog method has general application, and can significantly optimize at the design of hardware and software of these class methods.Be but that mode of action Modularly between various model units embeds in the general overall algorithm and data structure; And by space partition zone, the discrete analog method almost can obtain linear speed-up ratio, and each computing unit of hardware system can only provide memory shared or message transmission to specific only a few neighborhood calculation unit, expansion on a large scale quite easily; The complicacy and the scale of computing unit can reduce (as having only buffer memory, not having main memory) greatly simultaneously, thereby improve the ratio that is in the components and parts in the calculating operation, promptly improve its service efficiency, reduce cost.Compare with general general high-performance computer, though dwindle to some extent at the hardware system range of application of this Frame Design, but still have a large amount of demands.And the influence that the benefit that raising produced of the reduction of hardware cost and efficient will cause considerably beyond the former.Therefore this type systematic will have boundless prospect.

According to these characteristics, Chinese invention patent application 200510064799.1 has proposed a kind of a plurality of calculating and storage unit of comprising, forms array respectively, and each storage unit is connected with a plurality of computing units adjacent thereto; The parallel architecture expanded that each computing unit is connected with a plurality of storage unit adjacent thereto has been considered effect of particle short range and stackable general character with the local memory shared model.And Chinese invention patent application 200710099551.8 provides a kind of multi-layer direct connection cluster concurrent computational system towards the simulation of particle mould.This system is made up of a plurality of computing units of the one or more dimensions array of lining up one or more layers, directly communicate connection between the neighborhood calculation unit with layer, the computing unit of different layers communicates connection by switch, has tentatively considered the multiple dimensioned property that acts between particle when considering the short range effect.Simultaneously the computing unit in 200810057259.4 pairs of this cluster parallel systems of Chinese invention patent application has proposed concrete design proposal, particularly to (coupling of--GPU) and on a small scale the general-purpose computations chip of multi-threaded parallel (as central processing unit---CPU) is used and proposed concrete software and hardware solution as graphic process unit with the stream process chip of shared drive mode large-scale parallel in the computing unit.

Though considered the universalization (local memory share and the direct-connected pattern of short range) of multiple dimensioned effect (coupling of multi-layer direct connection and heterogeneity process chip) and short range action model in the above-mentioned patented claim, also systematically do not considered the special-purpose computing system of the multiple dimensioned discrete simulation system under the universal significance more; Though utilize connection between the levels unit to come long-range correlation between the transaction module unit, but also do not have to propose with the effect between the multiple dimensioned method restricted model of variation unit, simplify and calculate, the system mode of raising the efficiency, and to reliability, the dirigibility of large-scale parallel, so that the more deep problems such as optimization that calculating, storage and communication speed are mated.And the present invention will propose concrete solution at these problems.

Summary of the invention

One) technical matters that will solve

In view of this, it is a kind of towards multiple dimensioned complication system that fundamental purpose of the present invention is to provide, based on the algorithm and the computer system overall design of the multiple dimensioned method of variation, with the design of refining and optimize system simulation the algorithm of ubiquity is arranged, simplify high-performance computer system, the efficient that improves high-performance computer system at large amount of complex.

Two) technical scheme

For achieving the above object, the invention provides the general-purpose algorithm of a kind of Simulation of Complex system and the design proposal of computer software and hardware system thereof.

It is the model unit that is mutually related on the different levels that this algorithm will simulated system discrete.Simple relatively short-range interaction takes place between the lower level model unit, and while higher level model unit can retrain the motion of lower level model unit.

Described model unit can be called state variable with one group of specific variable description.One group of specific value of each state variable of a model unit is called as a state of this model unit.The model unit state changes by the interaction between model unit.

Described higher level model unit shows as a kind of extreme value or variation constraint to the constraint of lower level model unit, be extreme value or the variation condition that each related lower level model unit will satisfy the one or more variablees or the function of the state that is decided by these model units generally, and when having a plurality of extreme values or variation condition also constraint condition each other between them.

Effect between described lower level model unit has concurrency, promptly can carry out simultaneously, can calculate of the contribution of alternate model unit processed model unit state variation by the part or all of status information of alternate model unit of the state of processed model unit and effect with it to the processing of the effect between any two lower level model units.

Described lower level has superposability with the effect between the layer model unit, promptly lower level each be that these model units are separately to the function of the contribution of processed model unit state variation with the layer model unit to total contribution of the state variation of a processed model unit.

Described lower level has short range with the effect between the layer model unit, be that each lower level model unit only interacts with the same layer model unit that is no more than specific upper limit number, these model units are called as neighbours' model unit of this model unit, the model unit adjacent with these neighbours' model units is called as double neighbours' model unit, the definition of triple, quadruple and more multiple neighbours' model unit and the like.The specific upper limit number of neighbours' model unit of each model unit does not increase in company with the growth of layer model element number; Neighbours' model unit of freight weight limit is also arranged between each neighbours' model unit of each model unit simultaneously each other, and the upper limit of this tuple does not increase in company with the growth of layer model unit number.

Act as feedback-tied mechanism between described lower level model unit and higher level model unit, promptly direct-connected each lower level model unit is arranged to this higher level model unit transmitting portions or whole status informations with the higher level model unit.Each lower level model unit is the function of these information to total contribution of this higher level model unit state variation.Higher level model unit state is the function of higher level model unit state and variable quantity thereof to the contribution of each lower level model unit state variation.

The variation of described model unit state has the property of going forward one by one, be the new state of each model unit be decided by each with the layer model unit to the contribution of the lower level of total contribution of this model unit state variation and this model unit and higher level model unit to its state variation, and the new state of each model unit determines the state of the renewal of this model unit in the same way.

According to above-mentioned algorithm, its corresponding calculated machine hardware system comprises the computing unit on the different levels, and they can be close to the some corresponding computing unit exchange of computing unit or adjacent layer or share information with the same layer in the certain limit.The computing unit of lower level has more quantity and better simply logical circuit, can relatively simply calculate.The computing unit of higher level has less quantity and than complicated logic circuits, can carry out complicated calculating.

The composition mode of aforementioned calculation machine hardware system comprises:

A) this system is become by a plurality of computing unit group, and each computing unit group contains a cover communication exchange mechanism and a plurality of computing units that have direct communication to be connected with it.Described communication exchange mechanism logically lines up one or more layers one or more dimensions array, sets up between the neighboring switch structure with layer directly to communicate to connect.The exchange mechanism of a plurality of lower floors also has direct communication to be connected with specific upper strata exchange mechanism.

B) this system is made up of one or more layers one or more dimensions computing unit array, each computing unit has direct communication to be connected with plural exchange mechanism, guaranteeing in array, all to have at least between adjacent same layer computing unit an only indirect connection by an exchange mechanism, and all have at least one only to be connected between a specific calculation unit of each computing unit of lower floor and last layer by the indirect of an exchange mechanism.

Described computing unit is the logical organization unit with independently computing and communication function, comprises chip, chipset, programmable gate array (FPGA) chip, any one in integrated circuit board and the stand-alone computer or a plurality of combination arbitrarily; And the lower structures itself of these chips, chipset, programmable gate array integrated circuit board and stand-alone computer also can adopt the organizational form of described concurrent computational system.

Described higher level computing unit contains a small amount of general processor that can handle various complicated algorithms; And the lower level computing unit contains a large amount of application specific processors that only are fit to handle certain class problem, comprises the stream handle of single instruction multiple data, and application specific processor can not have storer or share a spot of storer, but result of calculation can be delivered to general processor; Some middle layer computing unit can contain above-mentioned two class processors simultaneously.

The array of described array for expanding arbitrarily, or the array for forming by any repeatably arrangement mode comprise the array that forms by rectangle or rectangular parallelepiped, triangle or tetrahedron, hexagon or tetrakaidecahedron form at least.The edge of described array is open, or is to link to each other with corresponding sides.

The described any connected mode that is applicable between described computing unit that communicates to connect comprises communication bus, cross bar switch, network interface card and network connection, serial ports or parallel port and serial ports or parallel port connecting line, USB mouth and connecting line thereof at least.

Described information sharing mode is applicable to any information sharing mode between described computing unit, comprises shared main memory at least, shares modes such as video memory, shared buffer memory or shared register.

Described communication exchange mechanism is for supporting described any multichannel input, single channel output or the multichannel input that communicates to connect, the communication exchange mechanism of multichannel output.

When this general-purpose algorithm was carried out in this system, the topological relation in this algorithm between model unit was corresponding with the topological relation between this parallel system computing unit.That is, the state variable of one or more lower level model units is kept in the computing unit of a lower level, and the state variable with model unit of neighborhood also is kept in the computing unit with neighborhood or in same computing unit; And the state variable of the higher level model unit of these model unit correspondences is kept in the higher level computing unit of described lower level computing unit correspondence.Main calculating about the state variation of a model unit is carried out in the computing unit of preserving its state variable, can directly send in the higher level computing unit and is used immediately than the result of calculation of lower floor's computing unit.

Three) beneficial effect

1. from technique scheme as can be seen, the multiple dimensioned analog computing system of this variation provided by the invention can be summarized a large amount of different complication systems, big to celestial body galaxy system, little essential characteristic to the molecular atoms system, promptly multiple dimensioned discrete topology has very strong versatility.

2. by algorithm and the computing hardware that suits the discrete topology simulation is provided, the present invention simplified high performance computing system design, improved its extensibility, fundamentally broken away from the components and parts calculated level to the upper limit of calculated performance setting, reduced its manufacturing cost and use cost and energy consumption, improved the actual efficiency of simulation system simultaneously.

3. the multi-level variation constraint that provides of simultaneity factor and feedback algorithm and relevant hardware structure also suit the multiple dimensioned feature and the inherent mechanism of action of simulated object, can simplify effectively computation model, greatly reduce simulation calculated amount, significantly improve computational accuracy.

4. above two aspects make concurrent computational system provided by the invention have distinct technicality again generally, and the multiple dimensioned modeling algorithm of algorithm-variation that being a class has distinguishing feature provides the computing system that is fit to very much its execution.It provides good solution route with combining then of the 1st aspect versatility for the big contradiction that solves high-performance computing sector versatility and counting yield.

Description of drawings

Fig. 1 is the logical organization synoptic diagram of the multiple dimensioned algorithm of variation provided by the invention; Wherein, 1 is model unit, and 2 are short-range interaction, and 3 is the upper strata model unit, and 4 are and the interaction of upper strata model unit, and 5 is the interaction between the model unit of upper strata, and 6 are and the more interaction of upper strata model unit;

Fig. 2 is for to separate the synoptic diagram of optimization problem with algorithm provided by the invention and concurrent computational system; Wherein 21 is the free parameter space, and 22 is stability criterion, and 23 is the extreme point of stability criterion, 24 is system of equations, 25 is lower floor's computing unit, and 26 is the upper strata computing unit, and 27 is the numerical value of a collection of coordinate points and stability criterion thereof, 28 is the new a collection of coordinate points and the numerical value of stability criterion thereof, 29 is the root of nonlinear equation, and 30 are lower floor's computing unit more, and 31 is the functional value under the different independents variable, 32 for there being the root interval, and 33 for the step is calculated the independent variable that adopts down;

Fig. 3 is described the synoptic diagram of type Multi-Scale Calculation for adopting algorithm provided by the invention and concurrent computational system; Wherein 36 is whole material, and 37 for adopting the zone of Finite Element Method calculating, and 38 is local defective or crackle, and 39 for adopting the zone of molecular dynamics simulation, and 40 is the overlapping region, and 41 is the parameter transmission between the levels computing unit;

Fig. 4 is for adopting the synoptic diagram of separating non-equilibrium evolution problem provided by the invention; Wherein 45 is the original states of calculating, and 46 for to satisfy truly separating of stability condition, and 47 is approximate solution, and 48 for testing the scope of separating, and 49 is better approximate solution;

Fig. 5 is the overall construction drawing of concurrent computational system provided by the invention; Wherein 51 are management control and input-output unit etc., and 52 is the top layer exchange mechanism, and 53 are the interlayer connection;

Fig. 6 is for being calculated the calculating node synoptic diagram that accelerator module and host constitute by Nvidia; Wherein 60 for calculating accelerator module, and 61 is professional computer card, and 62 is the PCI-Ex16 slot, and 63 are the exchange subcard, and 64 is the interface card on PCI-E X16 Gen 2 slots, and 65 is host;

Fig. 7 is input and output and aftertreatment node synoptic diagram; Wherein 71 is the GTX280 video card, and 72 is rack-mount server, and 73 are the storage adapter, and 74 is the memory disc battle array, and 75 is large screen display;

The first kind mode of Fig. 8 for connecting between computing unit provided by the invention; Wherein 81 is exchange mechanism, and 82 is interconnected between exchange mechanism, and 83 is the upper strata exchange mechanism;

The second way of Fig. 9 for connecting between computing unit provided by the invention; Wherein 91,92 be respectively two mutual dislocation overlapping areas in the computing unit array, 93,94 for connecting the exchange mechanism of computing unit in 92 and 94 zones respectively, and 95 is interconnected between exchange mechanism.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

One) enforcement of algorithm

In the multiple dimensioned modeling algorithm of aforesaid variation, short-range interaction (2) takes place between a large amount of discrete model units (1), simultaneously with some upper stratas model units (3) interactions (4), and short range effect (5) takes place between the model unit of upper strata, and with the model unit on upper strata more interact (6), form multiple dimensioned structural system by that analogy, it can be illustrated visually with Fig. 1.We can say that this almost is all common traits that need the complication system of high-performance calculation.Such as, social system has the most basic component units-people, and National People's Congress's relationships many and around him are closer, and long-range then contact is less.But society is not only single person-to-person effect, everyone also can participate in different groups and tissue, as family, work unit, community, club or the like, they have effect of contraction to the behavior of forming or participate in their individuality, simultaneously also mutual relationship can take place, and can form higher level group as a whole.In the material world, to whole universe similar feature is arranged also from elementary particle.Utilize this framework, but represent the mode of action Modularly between the various unit of various application to embed in the general overall algorithm and data structure, and need not independently to write software of calculation.

The concrete applied environment of some of this phantom frame can be listed below:

1) in multiple dimensioned discrete system, upper unit all can have constraint and influence in principle to the lower floor unit of forming it, make it with fully independently individual behavior is different, the leading and managing such as individual in the human society will be organized will be subjected to the constraint of its system and rules.Therefore the simulation of this reality of any reflection all needs to set up feedback-tied mechanism with upper unit naturally.Contiguous interaction also can take place between upper unit simultaneously, such as unit in the human society to unit, country to the cooperation and competition between country etc.Have again, be exactly this relation be can be nested, promptly have multistage levels unit relation.At this moment, the multiple dimensioned modeling algorithm of above-mentioned variation is exactly description the most reasonable and nature.In general, the element number on upper strata is few more, but interaction mode is complicated more, so they can rely on a spot of General Porcess Unit to simulate more, simulates by a large amount of application specific processors and act between simple relatively in a large number lower floor unit.

According to the multiple dimensioned principle of variation, this tied mechanism in a lot of physical systems shows as each model unit of lower floor and will satisfy certain stability condition generally, then is expressed as Multiobjective Variational Problems on the mathematics.In computer simulation, this Multiobjective Variational Problems can finally be converted into the optimization problem on the finite degrees of freedom.Can consider that than the complicated calculations unit on upper strata special optimization to optimization algorithm, its result of calculation can feed back to the form of certain correction in the relevant lower level computing unit this moment.

Such as intending in particle simulation (MaPPM) and the improved half implicit expression macroparticle analogy methods such as (MPS) in SPH, macroscopic view, represented a fluid micellar on each particle physics, and its state is on average to obtain according to certain weight function according to the state after other interior particle of certain limit develops on every side.And that the weight function that satisfies convergent requirement on the mathematics has is infinite a plurality of, at this moment, just provides functional variation condition for screening weight function according to the stable condition that flows.Such as, be flowing in the variation condition that stickiness dissipation minimum should be satisfied in the part, therefore the local stickiness dissipation statistical value that can relatively use different weight functions to obtain is judged the good and bad also selection of weight function and is optimized.Under the condition of simplifying, rule of thumb preliminary election has the candidate functions form of certain parameter, again by the minimum local stickiness dissipation corresponding parameters value of iterative.Though the calculating not of each time will be introduced iteration like this, but because the optimization of weight function may significantly improve precision, and this raising needed to obtain with littler, more particle and littler time step in the past, and therefore generally speaking calculated amount still may obviously reduce.

2) optimization problem itself also can utilize algorithm of the present invention and computer hardware to come rapid solving.Equation number in multiple dimensioned model, often occurs and be less than the variable number that model relates to, and needs are introduced the situation of empirical correlation formula, such as in turbulence model, because the not closure of first order modeling need be introduced the approximate hypothesis of second order and seal, and second-order model is when also being difficult to seal, and it is approximate then need to introduce three more complicated rank.Avoiding adopting a kind of method of empirical hypothesis is that the stability condition of introducing physics provides sealing.As shown in Figure 2, its algorithm is exactly the extreme point (23) of seeking stability criterion (22) in the free parameter space (21) of certain dimension generally, and the numerical value of last this criterion of each coordinate points (27) will obtain by finding the solution a system of equations in this space, for example the quantity of free parameter is respectively 2 and 3 in the energy minimum of gas-solid and gas-liquid flow system multiple dimensioned (EMMS) model, and the exponent number of system of equations is respectively 6 and 3.Criterion numerical value at this moment a large amount of coordinate points just can be by the mode of single instruction multiple data, by a large amount of lower floors computing units (25) parallel computation independently, and according to the result of calculation of these nodes, upper strata computing unit (26) can adopt the algorithm of complicated searching extreme point, such as simulated annealing, genetic algorithm etc., give a collection of calculative coordinate points (28) that makes new advances, iteration like this is progressively near extreme point.The process of seeking the root (29) of nonlinear equation simultaneously also can utilize a large amount of more computing units of lower floor (30) to calculate functional value (31) under the different independents variable respectively in the mode of single instruction multiple data, calculate the independent variable (33) that adopts by defining root interval (32) than lower floor's computing unit (25) and going on foot down then, process iteration for several times obtains root accurately.When the equation that exists a plurality of needs to find the solution, can also set up streamline mechanism between higher level and lower level computing unit, make the lower level computing unit when the functional value of parallel computation equation, the higher level computing unit can be judged the independent variable that has the interval and following step iteration of root to adopt of another equation simultaneously.

This type of algorithm also is applicable to the application of " magnanimity independently calculates-analysis and judgement-recomputate " pattern of large quantities of meeting such as the prediction, the prediction of financial product earning rate, graph and image processing of protein and multi-phase complex medium system rock-steady structure.

3) simulating purpose in some cases or for certain, although the interparticle acting force far away of being separated by is less, long-range interaction can not be left in the basket, and need handle with a kind of mode of simplification.Wherein most of simplified way is exactly to consider many collective's effects to acting between particle in essence.On the other hand, even long-range interaction can be left in the basket, remaining only considers its collective's effect in order to reduce calculated amount also can only directly calculate the least possible proximate interaction.Therefore, in multiple dimensioned discrete simulation system, contiguous being used between the unit mainly relies on application specific processor to calculate in lower floor's node, and collective's effect of remote action mainly relies on the upper strata node or/and general processor calculates.

This mode also can be applied in the multiple dimensioned method of description type.As shown in Figure 3, at this moment, will adopt the method for multiple different scale in the same simulation, such as in the mechanics of materials, the distortion of the material (36) of integral body is adopted Finite Element Method calculating (37) and local fault in material or crackle (38) employing molecular dynamics simulation (39).Though, carry out FEM (finite element) calculation and carry out molecular dynamics simulation at lower floor's computing unit (25) so just in time be adapted at upper strata computing unit (26) because the latter's computer memory zone is little but calculated amount is huge and algorithm is simple relatively.In order to realize the coupling of two kinds of computing method, generally need an overlapping region (40), the statistics of lower floor's computing unit Molecular Dynamics Calculation will pass to the input of the computing unit on upper strata as state, and (41) be revised and be retrained to the upper strata computing unit will to the motion state of each molecule in lower floor's computing unit according to new result of calculation simultaneously.This coupling scheme are just in time consistent with computing unit institutional framework provided by the invention.

Equally, this mode also can be applied in the multiple dimensioned method of related type.This method is with the statistical property of analog result on the small scale parameter as large-scale model.Broadly, obtain the constitutive relation of fluid mechanics equation group from the molecule kinetics, the process that obtains the Darcy's law of Porous Media from the equalization of fluid mechanics equation also belongs to the multiple dimensioned method of related type again.But to single system, the statistical property of small scale behavior can be expressed as correlation or chart by explicitly, makes large scale calculate in concrete example and can directly not depend on small scale calculating.And for some complex systems, non-newtonian flow such as some polymkeric substance, the determinative of rerum natura is numerous and closely related with flow state, be difficult to express with simple correlation or chart, must directly calculate " scene " by small scale provides, and Here it is based on the typical case that is coupled and calculates in the multiple dimensioned method of related type.At this moment, as in the overlapping region of the multiple dimensioned method of description type, the computing unit on lower floor and upper strata can carry out the calculating of small scale and large scale respectively, and realizes multiple dimensioned coupling by parametric statistics, transmission and the distribution of interlayer.

4) numerical solution of a lot of continuum Model.Because numerical solution finally shows as " effect " between one group of discrete grid block point, i.e. numerical value dependence, and under a lot of situation, mainly be under explicit scheme, the grid with this dependence also can be understood as certain special stationary particle system.More meaningfully, though explicit traditionally numerical method is simple and concurrency is good because relatively poor on stability and/or computational accuracy, use general not as good as implicit schemes, and variation multiple dimensioned be modeled as to address this problem brought possibility.

Itself has infinite degrees of freedom continuum Model, and the numerical evaluation model of setting up thus only possesses finite degrees of freedom, and promptly variable book that defines on discrete grid block number and each lattice point and calculating step number are long-pending.Therefore the system of equations found the solution of numerical evaluation is not that the sealing of physical problem is described, though itself seals on mathematics.Simultaneously round-off error and the limited computing grid that is caused by machine word-length also can make analog result depart from virtual condition with the error of calculation that step-length causes, and particularly larger in calculating, time is when longer.If but its corresponding physical model has reflected real process truly, under identical starting condition, it should can provide the alligatoring and the approximate description of real process within a short period of time.And simultaneously, if the stability condition that we can find real process to develop, just can be its variation constraint condition as this numerical evaluation model that does not physically seal, and from mechanism upper limit system and inhibition numerical error, make it more near real process.Now lift two example explanations:

A) entropy of any isolated system increases all the time, if the result that the entropy of system reduces has been appearred in the simulation of this type systematic, can utilize this principle to revise analog result by certain rule, makes the entropy of system not occur at least reducing, to obtain rational approximate solution.

B) as shown in Figure 4, linear nonequilibrium process satisfies the stability condition of entropy production minimum under certain condition, if can provide enough near the approximate solution (47) of truly separating (46) from original state (45) beginning numerical evaluation model, we just can be around this approximate solution more among a small circle in a plurality of tests of generation separate (48), calculate its entropy productive rate then, its reckling can be thought better approximate solution (49).

In said process, can fast parallel calculating on a large amount of lower level computing units to finding the solution of former approximate solution, and to the calculating of the index (as entropy and entropy production amount) of stability condition and correction because relate to the reduction and the exchange of the overall situation, need under higher level computing unit collaborative, carry out.

Two) hardware design example

The multiple dimensioned analog computation hardware system of variation as claimed in claim 9 can be from the components and parts group, realizes on the different hardware levels such as chip, integrated circuit board, server or workstation node.Its general structure as shown in Figure 5, wherein lower floor's computing unit (25), upper strata computing unit (26) are for double-layer structure, and can allow the multilayer computing unit in general sense.But the computing unit of higher level is comparatively complicated generally, stresses to handle the task of " decision-making and tissue " type, as the calculating and the feedback of extreme value and variation constraint; The computing unit in middle layer is simple relatively, stresses the task of " implement and coordinate " type, and the effect of effect long-range is calculated between main calculation of complex particle; The computing unit main task of lower level is the simple computation task that " execution " upper unit distributes, as simple interparticle effect.Introduce the concrete scheme of a cover that realizes from the node level below, and introduce wherein some accommodations or the substitute technology of each several part.This scheme substantially can be by the integrated realization of assembling of active computer device and parts, and the concrete scheme of implementing on other hardware level is also by similar part, but engineering design that need be how special.

1) calculates node

Calculating node is the main body and the major function carrier of this system, is the specific implementation form of computing unit among the present invention.On the level of node, can adopt commercial personal computer, workstation and server (comprising rack, blade type etc.), these nodes generally can dispose the central processing unit (CPU) of 1,2 or 4 x86 architecture, as the general processor of node among the present invention; Simultaneously, these nodes generally have many expansion slots that meet peripheral bus interconnect standard (Peripheral ComponentInterconnection), generally have 5～7, and can satisfy the version of multiple standards, as PCI, and PCI-X, PCI-E etc.Can peg graft on these expansion slot network interface card and calculate accelerator card etc. provides respectively to communicate to connect desired network interface unit and application specific processor among the present invention program.

Occurred many moneys in the market the calculating accelerator card of general-purpose computations graphic process unit (GPGPU) has been installed, the video card of many band graphic process unit (GPU) in fact also can use as calculating accelerator card simultaneously.These products mainly comprise:

Nvidia company:

The multiple GeForce video card of licensed-type production:

As GeForce8800GTX, 9800GX2 and GTX280 etc.;

Quadra specialty video card is as FX5600 etc.;

Specialty is calculated accelerator card, as Telsa C870 and C1060;

Independently calculate accelerator module, as Telsa S870 and S1070, S1075.

ATI company (being purchased by AMD):

Radeon HD3870, HD4870, video cards such as HD4870X2;

Specialty such as FireStream 9250 grades is calculated accelerator card.

Intel Company: the class x86 multinuclear computing architecture Larrabee that is is researching and developing.

IBM Corporation has also developed Cell Broadband Engine series blade server QS20 in addition, QS21, and QS22 etc. wherein combine general and application specific processor.

At these computing hardware, each company has also developed corresponding programmed environment and instrument, CUDA (Compute Unified Device Architecture) as Nvidia company, the CAL of AMD/ATI (Compute Abstraction Layer), the Cell BE SDK of IBM and the Ct of Intel etc.; And third party software Brook etc.They can with commonly used C of general processor and language compiling and the development environment of C++, and parallel computation standards coupling uses such as MPI (Message Passing Interface) and OpenMP.

Calculate accelerator card or video card if directly peg graft on the expansion slot of node, can select the GTX280 of Nvidia company for use, it has 240 stream handles, frequency of operation 1.35GHz, peak velocity 0.933Tflops (single precision), video memory 1GB; Or C1060, it also has 240 stream handles, frequency of operation 1.5GHz, peak velocity 1.08Tflops (single precision), video memory 4GB.The specialty that also can select for use AMD/ATI company to release is calculated accelerator card Firestream 9250, and it has 800 stream handles, but its video memory capacity is less, has only 1G; Or general video card HD4870, or HD4870x2, they have 800 and 1600 stream handles respectively, and frequency of operation 750MHz, video memory 1GB, peak velocity reach 1.2 and 2.4Tflops (single precision) respectively, or 0.24 and 0.48Tflops (double precision).

According to the configuration of expansion slot on computation requirement and the mainboard, generally can peg graft 2～4 aforementioned calculation accelerator cards or video card, wherein most product requirements preferably are plugged on the PCI-Ex16 Gen2 slot.Because present most products also require GPU of best each CPU nuclear (thread) control.Configurable one to four the four nuclear CPU of node, as the latest generation 4 core processor Harpertown that Intel adopts the 45nm technology to make, dominant frequency reaches more than the 3.0G.When CPU and GPU carry out exchanges data and monitoring, also have sufficient resources and other node swap data and bear part upper strata calculation task like this.Node should dispose bigger internal memory, thus need about 8 the DIMM grooves of configuration, to hold the DDR internal memory of maximum 64GB band ECC verification.Can realize speeds match preferably through requiring the exchanges data between node after repeatedly reading and writing again between video memory and main memory like this, hide call duration time.Hard disk can require SAS or the SATA hard disk more than 2～4 capacity 100GB of configuration and be RAID1 according to various computing.Dispose plate simultaneously and carry InfiniBand or gigabit ethernet interface card on twoport PCI-Express and the PCI-Ex8 groove position, and demonstration, keyboard, mouse interface.Owing to calculate the volume and the Power Limitation of accelerator card and video card, above-mentioned node generally need be designed to the industrial standard rack-mount server of 3～4U height, and the leeway of reducing is arranged in width and depth.Figure? for this type of calculates the synoptic diagram of node.

If further improve packing density, improve electromagnetic screen and heat dispersion, can adopt and have standalone chassis, power supply and heat abstractor, built-in polylith calculates the independent accelerator module that calculates of accelerator card, forms one by one or more calculating accelerator modules and a host and calculates node.Wherein the Tesla S1070 of the up-to-date release of Nvidia company calculates accelerator module and includes four specialty calculating accelerator cards, and every card contains a GT200 series GPU.This GPU adopts the 65nm ic manufacturing technology, has the video memory bit wide of 512bit and the video memory bandwidth of 142GB/s, and thread processor reaches 240, and clock frequency can reach 1.5GHz, and peak velocity surpasses 1Tflops, and the video memory capacity is up to 4GB.Therefore, whole calculating accelerator module can provide in the height of standard cabinet 1U above the single precision computing power of 4Tflops and 16GB GDDR3 video memory, and only about 700 watts of the typical power consumption of complete machine.As shown in Figure 6, one is calculated (60) four professional computer cards of accelerator module (61) and divides two groups to exchange subcards (63) and be connected with two by PCIeX16 slot (62), and these two subcards link to each other with interface card (64) on the slot of a PCIeX16 Gen2 of host respectively.Because the design of the patent of Nvidia, the exchange process of subcard does not have influence substantially to transmission performance.Corresponding host (65) can adopt the rack industry standard server of customization, except on the expansion slot not being directly pegs graft to calculate accelerator card or video card, but outside the less exchange subcard of grafting volume, configuration can be basic identical with above-mentioned node, but volume can dwindle greatly, and general design height can be 1U.

According to the position difference of node in the concurrent computational system that the present invention proposes, it can adopt the difference configuration in the various possibilities of introducing above.As the higher level computing unit time, configurable stronger general-purpose computations ability, as four nuclear CPU and bigger internal memories of configurable four high primary frequencies, more than 16GB, and bigger hard disk, more than 500GB, can not match acceleration components such as GPU simultaneously.And during as the computing unit of lower level, configurable lower general-purpose computations ability, as the low dominant frequency CPU of single four nuclear, the internal memory about 4GB, and the hard disk about 50GB, but can match more powerful acceleration components, as 4 so that the GPU more than 8.This species diversity concrete need to be determined to adjust flexibly according at that time market and concrete applicable cases, but its cardinal rule is to be determined by the system global structure that provides of the present invention.

Each top scheme all is based on existing accelerator card and the video card of calculating in market.But technical scheme of the present invention has also covered and has adopted special motherboard design, and GPU and CPU more closely are combined in the same mainboard so that the scheme in the same chip.In other words, the feasible design of node has comprised the various coupling scheme of single instruction multiple data and multiple-instruction multiple-data (MIMD) processing.

2) the input and output pre-process and post-process node of holding concurrently

In order to form complete computer system, also need to dispose high performance data input and output node or node group, and need carry out the ability of pre-process and post-process data, for this reason, the input and output pre-process and post-process node of holding concurrently is set.They are on the one hand by tree type network and calculating node interconnecting, simultaneously by storage area network and storage unit interconnection.For satisfying the aftertreatment demand, as shown in Figure 7, each node (71) can adopt three Nvidia to authorize the video card GeForce GTX280 (72) of OEM, and it can directly be plugged on the expansion slot of PCI-Ex16, takies the groove position of two overall height total lengths.This video card adopts GT200GPU equally, and frequency of operation is 1.33GHz, and corresponding peaks speed is 933Gflops, video memory capacity 1GB, peak power 160W.It is advantageous that it also possesses normal text and graphic presentation driving function except carrying out floating-point operation.Each node intends adopting the customization rack-mount server of 4U height.Utilizing the 4U node to dispose can be than more comprehensive advantage, and this type of node can be supported high speed interfaces such as PCI-E and Infiniband, and node will dispose high capacity internal memory and hard disk and extrapolation storage adapter (73), be connected with memory disc battle array (74) with realization.Can be two large screen displays of every GTX280 video card configuration (75) (30 inches simultaneously, 2560x1600 resolution LCDs), the user terminal separate unit of Gou Jianing can be formed the array of display that surpasses 24,000,000 pixels thus, and the parallel processing by 4 to 6 user terminals, can set up the above dynamic display capabilities of 100,000,000 pixels, satisfy the demand of high resolving power mass data processing.Is the general structure of these nodes as figure? shown in.

3) exploited in communication

Network is that particle simulation calculates each node in the cluster and interconnects basis with collaborative work.The group system high speed host who enters TOP500 at present will adopt the SP network of Infinipath, the IBM of gigabit Ethernet, 10G Ethernet, Quadrics, Myrinet, Infiniband, PathScale, the Numa-Link of Dolphin SCI, SGI, the RapidArray of Gray etc.Wherein widely used high-speed communicating network has three kinds of gigabit Ethernet, Myrinet, Infini-band etc. in the Linux cluster, and these three kinds of networks generally all use switch to connect each node.

Gigabit Ethernet is present most widely used network, can be used for the Linux cluster management, also can be used between Linux cluster node exchanges data and communicates by letter.Though performances such as its network bandwidth and delay are not as Myrinet, Infiniband, it is built easily, can satisfy the major applications demand by lower cost, still occupies share over half in the high performance computing system that enters Top500 at present.

Myrinet Network Design target is the performance that will obtain system area network in LAN environment.Therefore, it has adopted data packet communication and switching technology in the mpp system, when design, take into full account the applied environment that parallel system internal interconnection network transmission range is near, error rate is low, used the Link Control Protocol of simplifying to realize that data transmit the protocol overhead when having reduced data transmission; Adopt choke free Clos network topology structure, reduced the conflict of packet in network; The automatic reflection and the routing function of network adapter processor pair network can improve the reliability of network.

InfiniBand is a kind of switching fabric I/O technology, its mentality of designing is to set up a single connection link by a cover central authority (center InfiniBand switch) between equipment such as long-range memory, network and server, and command flow by center InfiniBand switch, its structural design gets very tight, improved performance, reliability and the validity of system greatly, the data traffic that can alleviate between each hardware device is congested.

In Chinese patent application 200710099551.8 and 200810057259.4, the multi-layer direct connection network structure suggestion that is proposed (is installed 4/6 gigabit ethernet port by the mesh or the cubic connection of gigabit ethernet card on every computing machine.Do not connect in succession between the gigabit ethernet port of neighbouring node by switchboard direct) implement.This considers that mainly in the discrete analog, message transmitted often comprises the parameter of hundreds of particles, and single Data transmission amount is bigger, and network delay is little to calculating influence, and the network bandwidth is even more important comparatively speaking.And the area of space that adopts in extensive discrete analog is parallel divides and the Shift communication pattern (only neighbouring node have exchanges data, need the minute quantity global communication) be prevailing model.Though adopt the good corresponding of physical topology that this direct-connected pattern needs node and process topology this moment, the task division of node is not too flexible, but communication simulation is simple, traffic load is easy to balance, but infinite expanding, total communication overhead is lower on the contrary, and particularly the large-scale calculations that total system is found the solution the single simulation problem simultaneously can obtain the best ratio of performance to price.

But,, set up based on Myrinet or InfiniBand and more high performancely to expand that short range connects and multiple dimensioned tree-like networking also is feasible scheme if reliability, fault-tolerant ability and the efficient of system is had higher requirement for more massive system building.Can adopt two class schemes to realize the connection between the calculating node in the present invention's concurrent computational system according to claim 2, these two class methods still can realize with Ethernet certainly for this reason:

A) as shown in Figure 8, a plurality of computing units (30) in a zone link to each other with a cover exchange mechanism (81) in certain one deck computing unit array, and to should also correspondingly forming array (but exchange mechanism quantity is than computing unit quantity much less) by each regional exchange mechanism of computing unit array, interconnected between adjacent exchange mechanism by one or more ports (82), thus mesh or cubic network on the formation exchange mechanism array.Because the exchange mechanism array is extendible, so the computing unit array that it connected also is extendible, and any computing unit adjacent in this computing unit array can communicate by maximum two cover exchange mechanisms.

B) as shown in Figure 9, a plurality of computing units in a zone (91) link to each other with a cover exchange mechanism (92) in certain one deck computing unit array, and a plurality of computing units in partly overlapping with it another zone (93) link to each other with another set of exchange mechanism (94), can make in the computing unit array any a pair of adjacent computing unit that the common direct-connected exchange mechanism of one cover is always arranged like this, thereby realize communication by maximum cover exchange mechanisms.Shown in Fig. 9 a, a kind of possible arrangement to two-dimentional or three-dimensional quadrature computing unit array is, divide one group of rectangle of computing unit array or rectangular parallelepiped zone (as 91 ex hoc genus anne zones) and also form a two dimension or cubical array, corresponding cover exchange mechanism in each zone (92 ex hoc genus anne exchange mechanism).Simultaneously the computing unit array is also divided at misplace respectively another rectangle in half zone or rectangular parallelepiped area array (as 93 ex hoc genus anne zones) of row, column (and layer) direction with above-mentioned rectangle or rectangular parallelepiped area array, and the computing unit in each zone wherein also connects a cover exchange mechanism (94 ex hoc genus anne exchange mechanism).Each computing unit has all connected two cover exchange mechanisms like this.Can simultaneously have two and only by one overlap the communicating to connect of exchange mechanism not crossing over a pair of computing unit of any zone boundary this moment, and can work simultaneously also can backup each other.If port number allows, the exchange mechanism array also can be set up contiguous connect (95) as last scheme (a), like this to any a pair of contiguous computing unit to existing two to overlap communicating to connect of exchange mechanisms through two at most.This has just further improved communication speed and reliability, and cost is the quantity that has increased exchange mechanism and network interface unit certainly.For this reason, shown in Fig. 9 b, when a cover exchange mechanism attachable computing unit quantity more for a long time, the computing unit that only will be on the zone boundary is connected on some cover exchange mechanisms, thus most of computing unit only needs a cover exchange mechanism be attached thereto and only need a network interface unit.And shown in Fig. 9 c,, can guarantee that then computing unit adjacent on any diagonal line also can establish a communications link by a cover exchange mechanism if all borderline computing units all are connected on the cover exchange mechanism.

In such scheme, the upper strata computing unit also is connected on the exchange mechanism that lower floor's computing unit connected, and the upper strata computing unit also can connect in a manner described simultaneously, forms the multilayered structure global network by that analogy to connect, be tree network, realize the propagation of non-adjacent communication and instruction.Certainly, the multi-layer direct connection network that the also available Chinese patent application 200710099551.8 of above-mentioned two schemes is proposed is realized, but such scheme is keeping unlimited extensibility (promptly along with the increase of computing unit number, communication speed between the neighborhood calculation unit can correspondingly not increase) time improved the reliability and the fault-tolerant ability of system, promptly when a computing unit broke down, other computing unit that is connected to same set of exchange mechanism can substitute its job task easily.And by one the cover exchange mechanism delay be acceptable to Myrinet and Infini Band.If the employing gigabit Ethernet postpones in order to improve transfer rate and to reduce, can adopt the message passing interface of simplification, with shielding and direct-connected communicate by letter irrelevant or unnecessary communication operations and service, to improve actual performance to greatest extent.As adopt lightweight protocols such as EMP, PM/Ethernet, M-VIA and GAMMA.

Except the bigger data network of the traffic, in order to improve system effectiveness, stability and ease for use, preferably having independently, managerial grid is used to connect all computing units, input and output and pre-process and post-process unit, management and login unit.Because transmission quantity is less relatively and mainly carry out in the specific period, but relate to the communication pattern of more complicated, so still can adopt common Gigabit Ethernet technology.The configurable many cover exchange mechanisms of this network realize between exchange mechanism that the multiport high speed is interconnected.

Other design of said system is set up on prior art substantially as management, supervisory system etc., repeats no more.It is worthy of note, although the present hardware organization's mode that proposes is system, complete more, but technology such as local shared drive that proposes in Chinese invention patent application 200510064799.1,200710099551.8 and 200810057259.4 and multi-layer direct connection also can be set up the corresponding calculated system and can move the algorithm that the present invention proposes on different levels.To this, for higher efficient and reliability, also can carry out a lot of subsequent development work, such as InfiniBand is used for and node between direct-connected gigabit Ethernet to replace prior art to realize, or in the design of the tree network of Chinese invention patent application 200510064799.1 and 200710099551.8, introduce interconnected with between layer switch, thereby the present invention and this two inventions are used in coupling, realize that more flexible, reliable network connects.

Above-described specific embodiment, purpose of the present invention, technical scheme and beneficial effect are further described, it should be understood that the above just to the explanation of some typical implementations of claim of the present invention, is not limited to the present invention.All other the different implementations that proposes within the spirit and principles in the present invention of those skilled in the art; as adopt different communication software and hardwares and different node configuration etc.; and any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. general-purpose algorithm based on the multiple dimensioned method of variation, simulation system is dispersed and is the model unit that is mutually related on the different levels in this algorithm, simple relatively short-range interaction takes place between the lower level model unit, the motion of constraint of higher level model unit and correction lower level model unit.

2. algorithm according to claim 1, it is characterized in that, described model unit is by one group of specific variable---state-variable description, a state of one group of specific this model unit of value of each state variable of a model unit, the model unit state changes by the effect between the model unit.

3. algorithm according to claim 1, it is characterized in that, described higher level model unit is to a kind of extreme value of being constrained to of lower level model unit or variation constraint, be that related each lower level model unit satisfies the one or more variablees of the state that is decided by these model units or the extreme value or the variation condition of function generally, and when having a plurality of extreme values or variation condition their constraint condition each other.

4. algorithm according to claim 1, it is characterized in that, effect between described lower level model unit has concurrency, promptly the processing of the effect between any two lower level model units is carried out simultaneously, by the state of processed model unit and with it the part or all of status information of the alternate model unit of effect calculate of the contribution of alternate model unit to processed model unit state variation.

5. algorithm according to claim 1, it is characterized in that, described lower level has superposability with the effect between the layer model unit, promptly lower level each be that these model units are separately to the function of the contribution of processed model unit state variation with the layer model unit to total contribution of the state variation of a processed model unit.

6. algorithm according to claim 1, it is characterized in that, described lower level has short range with the effect between the layer model unit, be that each lower level model unit only interacts with the same layer model unit that is no more than specific upper limit number, the specific upper limit number of neighbours' model unit of each model unit does not increase in company with the growth of layer model element number; Neighbours' model unit of freight weight limit is also arranged between each neighbours' model unit of each model unit simultaneously each other, and the upper limit of this tuple does not increase in company with the growth of layer model unit number.

7. algorithm according to claim 1, it is characterized in that, act as feedback-tied mechanism between described lower level model unit and higher level model unit, promptly direct-connected each lower level model unit is arranged to this higher level model unit transmitting portions or whole status informations with the higher level model unit; Each lower level model unit is the function of these information to total contribution of this higher level model unit state variation; Higher level model unit state is the function of higher level model unit state and variable quantity thereof to the contribution of each lower level model unit state variation.

8. algorithm according to claim 1, it is characterized in that, the variation of described model unit state has the property of going forward one by one, be the new state of each model unit be decided by each with the layer model unit to the contribution of the lower level of total contribution of this model unit state variation and this model unit and higher level model unit to its state variation, and the new state of each model unit determines the state of the renewal of this model unit in the same way.

9. the concurrent computational system of an algorithm according to claim 1, this system comprises the computing unit on the different levels, they can be close to the some corresponding computing unit exchange of computing unit or adjacent layer or share information with the same layer in the certain limit, the computing unit of lower level has more quantity and better simply logical circuit, can relatively simply calculate; And the computing unit of higher level has less quantity and than complicated logic circuits, can carry out complicated calculating.

10. concurrent computational system according to claim 9 is characterized in that its composition mode comprises following mode at least:

A) this system is become by a plurality of computing unit group, each computing unit group contains a cover communication exchange mechanism and a plurality of computing units that have direct communication to be connected with it, described communication exchange mechanism lines up one or more layers one or more dimensions array, sets up between the neighboring switch structure with layer directly to communicate to connect.The exchange mechanism of a plurality of lower floors also has direct communication to be connected with specific upper strata exchange mechanism;

B) this system is made up of one or more layers one or more dimensions computing unit array, each computing unit has direct communication to be connected with plural exchange mechanism, and in array, all have an only indirect connection by an exchange mechanism between adjacent same layer computing unit at least, all have at least one only to be connected between a specific calculation unit of each computing unit of lower floor and last layer by the indirect of an exchange mechanism.

11. concurrent computational system according to claim 9, it is characterized in that, described computing unit is the logical organization unit with independently computing and communication function, comprise chip, chipset, programmable gate array chip, any one in integrated circuit board and the stand-alone computer or a plurality of combination arbitrarily; And the lower structures itself of these chips, chipset, programmable gate array integrated circuit board and stand-alone computer also can adopt the described organizational form of claim 9.

12. concurrent computational system according to claim 9 is characterized in that, described higher level computing unit contains a small amount of general processor that can handle various complicated algorithms; And the lower level computing unit contains a large amount of application specific processors that only are fit to handle certain class problem, comprises the stream handle of single instruction multiple data; Application specific processor can not have storer or share a spot of storer, but result of calculation can be delivered to general processor; Some middle layer computing unit can contain above-mentioned two class processors simultaneously.

13. concurrent computational system according to claim 9, it is characterized in that, the array of described array for expanding arbitrarily, or array for forming by any repeatably arrangement mode, at least comprise the array that forms by rectangle or rectangular parallelepiped, triangle or tetrahedron, hexagon or tetrakaidecahedron form, the edge of described array is open, or is to link to each other with corresponding sides.

14. concurrent computational system according to claim 9, it is characterized in that, the described any connected mode that is applicable between described computing unit that communicates to connect comprises communication bus, cross bar switch, network interface card and network connection, serial ports or parallel port and serial ports or parallel port connecting line, USB mouth and connecting line thereof at least.

15. concurrent computational system according to claim 9 is characterized in that, described information sharing mode is applicable to any information sharing mode between described computing unit, comprises shared main memory at least, shares modes such as video memory, shared buffer memory or shared register.

16. concurrent computational system according to claim 9 is characterized in that, described communication exchange mechanism is for supporting described any multichannel input, single channel output or the multichannel input that communicates to connect, the communication exchange mechanism of multichannel output.

17. concurrent computational system according to claim 9, it is characterized in that, when general-purpose algorithm according to claim 1 is carried out in this system, topological relation in this algorithm between model unit is corresponding with the topological relation between this parallel system computing unit, the state variable that is one or more lower level model units is kept in the computing unit of a lower level, and the state variable with model unit of neighborhood also is kept in the computing unit with neighborhood or in same computing unit; And the state variable of the higher level model unit of these model unit correspondences is kept in the higher level computing unit of described lower level computing unit correspondence; Main calculating about the state variation of a model unit is carried out in the computing unit of preserving its state variable, can directly send in the higher level computing unit and is used immediately than the result of calculation of lower floor's computing unit.