WO2015044696A2

WO2015044696A2 - Computer architecture and processing method

Info

Publication number: WO2015044696A2
Application number: PCT/HU2014/000086
Authority: WO
Inventors: ÁDám RÁK; György Gábor CSEREY
Original assignee: Pázmány Péter Katolikus Egyetem
Priority date: 2013-09-27
Filing date: 2014-09-25
Publication date: 2015-04-02
Also published as: HUP1300561A2; WO2015044696A3; WO2015044696A8

Abstract

The invention is a computer architecture comprising a central processing device adapted for processing a data stream (40) consisting of data elements and comprising an instruction array and a data array. Said central processing device comprising at least one array of processing units (30, 32) connected to each other and being capable of executing an operation on the data array based on the instruction array, and data transferring elements connected to the outermost processing units (30, 32) of the at least one array of the processing units (30, 32) and adapted for transferring the data stream (40). The computer architecture further comprises at least one data storage device connected to the central processing device and an instruction definition device adapted for defining an instruction array implementing a computer program on the architecture and determining the traverse route (31) of the data stream (40), said instruction definition device is connected to the central processing device, and the data storage device comprises a storage unit adapted for storing the data stream (40), and a sorting unit adapted for reordering the data elements. The invention is furthermore a processing method.

Description

COMPUTER ARCHITECTURE AND PROCESSING METHOD

TECHNICAL FIELD

The invention relates to a computer architecture adapted for processing a computer program, having a central processing device comprising at least one array of processing units, as well as to a processing method.

BACKGROUND ART

In the past decade the development of processors has taken a new turn; the number of processing units integrated on a single chip has been growing exponentially, and predictions show that this trend continues in the next decade. There are a number of known multiprocessor or kilo-processor architectures, of which the most important are CUDA (Compute Unified Device Architecture), FPGA (Field Programmable Gate Array), and the integrated circuit chips of the video cards made by AMD.

Multiprocessor devices, including GPUs (Graphics Processing Units) are presently most widely applied in personal computer video cards for which the market of video games produces increasing demand. There is demand for such 3D games and programs which require powerful single-purpose hardware capable of performing a mass of identical operations on very large amounts of data. This demand has catalysed the development of GPUs which at the same time are perfectly suitable for more general tasks requiring high computing power. The structures applied for 3D visualisation procedures are similar to scientific or industrial physics simulations. Having recognised the market demand, manufacturers provide more and more general access to their devices, which has increased the prevalency of GPUs which are easily available at relatively low prices. With this possibility at hand, the (almost) real-time implementation of computation-intensive algorithms has come within reach.

There is continuous demand for further increasing the speed of the computer architectures, and also for reducing their power consumption while at the same time increasing the number of processors. The currently available, widely applied GPU designs comprise a large number of identically configured processors, processing units which access the data to be processed via a hierarchically organised memory system. In GPUs the processing unit are organised in groups and are capable of communicating locally within each group. In GPUs there is no full locality, their memory system is organised in a so-called tree structure. For a fully local organisation of identical processing units more than one global lines should be routed to each processing unit, which, in spite of the local-only interconnections between the processing units, limits the maximal size of the device, such as in case of the implementation of CNNs (Cellular Neural/Nonlinear Networks). At the level of transistors the sizes are so small (the relative size of the chip is so large) that communication between two outermost located processors on the chip takes too much time.

A further difficulty with GPU-based devices is related to how data are conveyed to the processing units. The measure of optimal utilisation is the number of operations performed on a piece of data such that the computing power of the architecture is utilised to its full capacity. In case memory is utilised optimally, this usually amounts to 25-30 operations. In case of GPUs, this number is so small because processing is performed overlapped with data transfer, but - since algorithms typically do not perform 25-30 operations on a piece of data before writing it out to memory - it is usually still too large for providing optimal architecture utilisation. In case the device does not perform that many operations on each data piece, then - due to the time required for memory access - the processing units become "starved", meaning that they are not active for a considerable part of time.

In addition to the GPU, other parallel architectures with significantly wide application are the above mentioned FPGA, the similar FPOA (Field Programmable Object Array), and the so-called systolic array. In the following, these known solutions are compared to each other and to a conventional CPU (Central Processing Unit) as far as their memory and computer architectures are concerned.

Generally, CPUs comprise a single memory, and apply general caching. Generally meant under caching that data are stored such that they can be made available for further use faster than it would be possible applying conventional data storing. The so-called cache memory includes data elements selected in an expedient manner and is configured such that it can make the stored data elements available for the CPU as quickly as possible. CPUs are capable of concurrently/simultaneously running a relatively small number of threads on a small number of arithmetic units but utilising these arithmetic units very effectively. In multi-core CPUs each core may have cache memory dedicated to it.

The computer architecture of CPUs is characterised by the so-called "out of order execution", i.e. if necessary, instructions are reordered by the CPU at run-time. A significant part of the CPU chip is occupied by the cache memory, with the number of transistors belonging to logic circuits performing actual arithmetic operations being smaller than other logic circuits.

The GPU memory architecture comprises addressable on-chip local memory dedicated to each core or a group of cores. Global memory is also connected to GPUs which also apply caching, albeit to a lesser amount than what is applied for CPUs. GPUs are typically capable of so-called "two-dimensional caching", i.e. of handling memory as two-dimensional arrays.

The computation architecture of GPUs is characterised by a very large number of simple cores which are usually SIMD (Single Instruction, Multiple Data) type vector processors. The majority of transistors of a GPU are processing units, most frequently a combination of an ALU (Arithmetic Logic Unit) and an FPU (Floating- Point Unit). GPUs are characterised by very simple pipeline management and deep pipelines. Pipelines are routes leading through the data processing components of the computer architecture. By "deep pipeline" it is meant that the sequence of operations to be processed is long. Compared to CPUs, therefore, GPUs comprise a very large number of cores, but these cores are typically utilised less effectively.

FPGAs may comprise memory connected to their chip, with local memory units being assigned to its on-chip arithmetic units. FGPAs usually apply no (or only very simple) caching. In principle, caching can be implemented but it would take up too much of the chip's surface. FPGAs typically comprise a very large number, usually more than a thousand, processing units (e.g. arithmetic units) that together comprise as many as hundreds of thousands of simpler logic execution units. In case an FPGA is utilised effectively, it is characterised by deep pipelines. Reprogramming of FPGAs, that is, the redefinition of the computation architecture implemented on the FPGA, is a relatively slow process compared to the computing performance of the FPGA unit. The redefinition of the computation architecture involves a so-called auto-routing procedure which essentially means designing the computing hardware. During the auto-routing procedure the program designed for the FPGA is converted into physical logic circuit interconnections by the compiler. The reason why this procedure is performed slowly is that it involves finding the optimal (or near-optimal) one of the several interconnection possibilities. Thus, the reprogramming of a conventionally applied FPGA circuit may take as long as half an hour using currently available processors. Due to the large search space significant delays are introduced by the wiring. Thereby, in case of FPGAs the redefinition problem is an exponential time problem, which is usually approximated by a polynomial time algorithm. FPGAs are extremely sensitive to suboptimal routes. By "suboptimal route" it is meant that the data stream route defined for the FPGA has not been optimised to a proper extent. In case of a suboptimal route, processing time may increase significantly.

The memory architecture of FPOA is essentially the same as the memory architecture of FPGA. Compared to FPGAs, FPOAs comprise higher-level processing units, such as ALUs and FPUs, as well as a small number of freely programmable universal logic units.

The so-called systolic array has no predefined memory architecture. The systolic array is a square array where the data inputs are arranged at the left-side and top outermost processing units, while the outputs are arranged at the right-side and bottom processing units. Due to its configuration, a systolic array may not simultaneously run multiple algorithms, only the threads belonging to a single algorithm. The processing units of a systolic array are ALUs or FPUs that may be implemented as multiplier/adder arrays. Elements of the array are interconnected with unidirectional links, and thus no circles (corresponding to loops) can be defined in the array. As a result of the unidirectional links, the direction of the data flow through the array is predefined in hardware, and thereby it cannot be modified. The systolic array can be programmed by defining the order in which data are input in the array, and by specifying the arithmetic instructions sent to the processing units. In a systolic array there is practically no control-flow, only dataflow, i.e. only the routes along which the data flow through the array are defined. The systolic array has characteristically good utilisation of chip area, but is only effective for performing specific, predetermined tasks. There are such known systolic arrays wherein the instructions and the data travel together. The systolic array is not Turing-complete. Turing-completeness of an architecture means that an arbitrary Turing machine may be implemented on the architecture, and therefore, the universal Turing machine may also be implemented thereon, which means that all mathematical algorithms can be implemented.

In addition to the above examples, several other multiprocessor computer architectures (i.e. architectures applying processing unit arrays) are known.

In US 5,828,858 a multiprocessor device is disclosed which is adapted for processing a data stream consisting of a head defining the route (pathway) through the architecture and a main portion comprising data. The architecture disclosed in the document comprises processing units interconnected in an arraylike manner and connected to a so-called crossbar network, adapted for providing access of the outermost processing units, which are arranged along one side of the array. In the device according to the document the data streams propagating through the processing units may have junctions or two data streams may become merged. As detailed in the document, in some variations it is not necessary to provide a global clock signal for the processing units (by applying so-called self- timed data streams). The resulting delays are managed by introducing optionally available bypass routes. The device according to the document comprises special data storage device having synchronisation and scatter/gather functionality. In the document, synchronisation is used to refer to that the data streams propagating towards the same operation wait for each other to arrive. The meaning of gather/scatter functionality is not given in the document. In the architecture according to the document the route of the data is established dynamically, applying so-called stream controllers. Stream controllers are responsible for optimising the utilisation of the processing units, and are also applied for implementing the data streams at run-time on the actual hardware configuration. The architecture disclosed in US 8,127,111 B1 comprises processing units interconnected in an array-like manner. In the architecture according to the document, dynamic and static networks of data streams are defined, which may co-exist in specific cases. In the dynamic network, the data paths are defined dynamically. The "snake-like" data stream that propagates through the device consists of a "head" comprising instructions and a "body" comprising the data. At the end of processing the path covered by the snake-like data stream is emptied by the end portion of the stream to make it available for other data streams. For congestion management, a "backward" message may be sent from a congestion location. To preserve the order of data pieces, the data pieces processed earlier may wait for those data pieces for which processing takes longer. In case the order is lost, a buffer storage with reordering functionality may be applied, and/or the proper order of data may be ensured by the processing units. Certain components of the device, for example the processing units, exchange messages. Some messages may be integrated in the header portion of the data stream. A so- called "memory ordering" procedure is performed in the external memory. "Memory ordering" according to the document means that the order of messages arriving from different components of the device is maintained, or, if necessary, the messages are reordered. In the architecture according to the document global wiring is applied for conveying the clock signal to the processing units. During processing the data stream flows through a plurality of pipeline stages, with the data being passed from one processing unit to another through the pipeline stages in a predetermined number of clock cycles. The architecture scales well, and clock frequency is not limited by effects due to wire lengths. According to US 7,149,875 B2, ordering of the data carried in the data stream is performed utilising a reordering unit connected to the memory corresponding to the processing units, and registers attached thereto. The data carried by the data stream leaving the array of processing units are reordered by these units prior to storing the data stream in the memory. In US 7,568,084 B2 also real-time definition of the data pathway is applied. According to the document the feature that the data elements are processed clock cycle by clock cycle by the pipeline stages located along the route of the data stream is made use of. According to the document this feature may be applied for achieving synchronisation of "signals" arriving through different paths even if no "long-distance" wiring is applied.

In US 5,598,408 components - so-called "crossbars" - adapted for providing that data are properly conveyed to the processing units are applied such that the crossbars surround the array of processing units.

In the solutions disclosed in US 5,898,881 , US 6,173,386 B1 , US 2003/0065904 A1 , US 7,523,292 B2, US 7,734,896 B2, US 7,769,981 B2, US 7,966,481 B2, US 8,058,899 B2, US 8,078,829 B2, US 8,078,839 B2 and US 8,131 ,975 B1 , interconnected processing units arranged in an array are applied.

The documents cited below present solutions for the management of the possibly occurring empty pipeline stages. In US 5,019,967 and US 7,577,827 B2 architectures are disclosed wherein during data processing efforts are made to eliminate empty pipeline stages. In US 7,024,543 B2, US 2007/0237146 A1 and US 7,671 ,627 B1 it is detailed that the congestion of data processing and other parcels may be eliminated in case empty pipeline stages are present. In US 6,993,641 B2 the development of congestions is disclosed. Based on the above referenced documents it is understood that in known architectures and systems efforts are made to eliminate empty pipeline stages in order to generally improve processing speed.

In known multiprocessor architectures the clock signal is typically conveyed to the processing units by means of global wiring. By way of example, in known architectures the so-called H-structure is applied, which provides that the clock signal is conveyed to the processing units in a synchronised manner. In case the H-structure is utilised, clock frequency is limited by the capacitance of the wiring applied for forming the structure. In US 7,015,765 B2 a simple clock distribution wiring structure is disclosed that is different from the conventionally applied H tree structure. In systems comprising the clock distribution wiring according to the document no significant delays of the clock signal may occur due to the configuration of the clock wiring. A system comprising simple clock wiring and providing error-free data transfer is disclosed in US 7,668,272 B1. The following documents are related to the synchronisation of clock signals between specific components of computer architectures. In US 6,698,006 B1 the disadvantages of the H-tree clock distribution structure are described. US 6,711 ,724 B2 and US 6,911 ,854 B2 disclose systems wherein clock signal delay and asynchronous clock signal transfer are taken care of. In US 6,513,149 B1 , US 7,117,472 B2 and US 7,546,567 B2 different clock signal distribution schemes are disclosed. According to US 7,795,943 B2 specific components of a computer architecture are connected to a common clock signal source.

In known multiprocessor computer architectures, finding the optimal data order during the preparation for processing and during processing itself is often problematic. Since data ordering performed by the processing units takes a long time compared to the typical time scale of data processing, the processing units of multiprocessor architectures are extremely sensitive to the order in which they receive the data to be processed. The need for data ordering is sustained during the entire lifetime of a data stream because the data elements of the stream can become out of order (mixed up) - by way of example, due to the presence of junctions and loops -, and a data stream corresponding to an average program may cycle through the processing units and data storage device multiple times during processing. Accordingly, some known computer architectures comprise a device adapted for ordering/sorting the data elements of specific data streams, such device being connected to the data storage device made available for the data streams to be processed by the processing units. A common disadvantage of such multiprocessor architectures is that data ordering/sorting takes up a significant amount of time due to the distance between the device adapted for data ordering and the data storage device, and also due to the separated configuration of these two devices.

Known computer architectures comprising arrays of processing units have the further disadvantage that in most of such architectures the clock signal is conveyed to the individual processing units applying a dedicated wiring. The other part of such architectures is extremely sensitive to congestion of the data streams at the processing units, because in implementations not comprising dedicated clock signal wiring it is required that the data streams are processed in an exceedingly accurate and coordinated manner. To manage the time delays occurring during processing that could deteriorate synchronisation, some of these multiprocessor architectures comprise components adapted for introducing delays in the paths of the data streams. In such devices, therefore, data streams are coordinated - by way of example, in order that the data streams meet at a junction in a synchronised manner - by delaying the data stream that is "ahead of schedule", i.e. synchronization is achieved by slowing down data processing.

DESCRIPTION OF THE INVENTION

The primary object of the invention is to provide a computer architecture which is free of disadvantages of prior art solutions to the greatest possible extent.

An object of the invention is to provide a computer architecture comprising at least one array of processing units which is capable of sorting in order the data elements of the data stream and of reordering the data elements during the processing of the data stream without slowing down the operation of the processing units and with greater efficiency than known architectures.

A further object of the invention is to provide a computer architecture comprising at least one array of processing units wherein it is not required for the operation of the architecture to distribute the clock signal applying a global wiring, and which has higher fault tolerance than known architectures with respect to delays occurring in the data streams to be synchronised, the architecture providing for the synchronisation of data streams with greater efficiency than known architectures.

The objects of the invention can be achieved by the computer architecture according to claim 1 and the method according to claim 17. Preferred embodiments of the invention are defined in the dependent claims. Sorting of the data comprised in the data stream is performed by the computer architecture according to the invention with higher efficiency than in known solutions such that the at least one data storage device thereof, connected to a central processing device comprising at least one array of processing unit, comprises, in addition to a storage unit adapted for storing the data stream, a sorting unit adapted for reordering the data elements comprised in the data array. According to the invention, therefore, the sorting unit is integrated in the individual data storage device. The integration of the sorting unit provides that the reordering of the data carried by the data stream is performed as quickly as possible, and also allows - without the application of any external devices - that the data can be reordered each time the data stream passes through a data storage device.

The computer architecture according to the invention has simpler structure and a smaller search space than FPGA. The inventive computer architecture has simpler interconnections between the processing units, and also more sophisticated processing units are applied compared to FPGA. Thereby the computer architecture according to the invention has better tolerance for non-optimal or suboptimal routes, and thus the data stream routes may be generated much faster. The instruction definition device adapted for route definition may be a simple single-purpose processor, or may be emulated by a conventional processor. According the above, for generating the instruction array the instruction definition device generates a model of the central processing device and applies it for defining the route of the data stream therethrough. Applying the instruction definition device the delays resulting from suboptimal routes may be reduced to a minimum. In case of large quantities of data the delay is negligible.

The computing performance for chip surface unit can be estimated and compared for different multiprocessor architectures. Based on that comparison, the computer architecture according to the invention provides higher computing power for chip surface unit than presently applied GPUs for the following reasons:

• the architecture according to the invention does not comprise cache memory, which may take up as much as 33% of the chip surface in GPUs; · the architecture according to the invention does not have so-called register- file memory. In GPUs the register-file memory takes up a significant surface area since for it to operate effectively a large amount of register-file memory is required - accordingly, cache and register-file memory collectively occupy at least 50% of a GPUs surface area;

· in the architecture according to the invention the interconnections between the processing unit are local, and thus the wiring implementing the interconnections may almost completely overlap with the other layers, i.e. contrary to GPUs it does not take up extra area;

Our calculations indicate - in accordance with what has been put forward above - that the surface utilisation of the architecture according to the invention is significantly more effective than that of a GPU, even with the added data transferring elements and instruction definition device.

The computer architecture according to the invention is suited equally well for 3D rendering, for Ethernet routing, cryptographic algorithms, for managing large databases, for simulations and scientific calculations. The architecture according to the invention is Turing-complete, i.e. any Turing machine can be implemented on it.

The components of the computer architecture according to the invention may be integrated on a single chip. Parameters of the individual components may be modified as required or permitted by advances in technology. BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the invention are described below by way of example with reference to the following drawings, where

Fig. 1 is the schematic block diagram of the computer architecture according to the invention,

Fig. 2 illustrates an exemplary operation that may be performed applying the computer architecture according to the invention,

Fig. 3 illustrates how a data stream is processed by the invention,

Fig. 4 illustrates data reordering performed by the data storage device of the invention,

Fig. 5 illustrates, showing individual data elements, the data reordering operation performed by the data storage device of the invention,

Fig. 6 illustrates an embodiment of the central processing device of the computer architecture according to the invention,

Fig. 7 illustrates the interconnections implemented applying interconnection elements in an embodiment of the central processing device between the processing units, the enhanced functionality processing units, and the data transferring elements,

Fig. 8 illustrates an exemplary route of a data stream in case of the embodiment shown in Fig. 6,

Fig. 9 illustrates an embodiment of a processing unit of the central processing device,

Fig. 10A illustrates an embodiment of the operation execution unit of the processing unit,

Fig. 10B shows, also indicating the pipeline stages, the embodiment illustrated in Fig 10B,

Fig. 11 is a schematic block diagram illustrating an embodiment of the data storage device of the computer architecture according to the invention, and Fig 12 illustrates an algorithm that may be performed applying the computer architecture according to the invention. MODES FOR CARRYING OUT THE INVENTION

Fig. 1 shows the schematic block diagram of an embodiment of the computer architecture according to the invention, adapted for processing a computer program. The computer architecture according to the invention comprises a central processing device 10 adapted for processing a data stream which consists of data elements and has an instruction array and a data array, as well as at least one data storage device 12 and an instruction definition device 20 adapted for defining the instruction array, with both the data storage device 12 and the instruction definition device 20 being connected to the central processing device 0. In some embodiments the instruction definition device is integrated in the central processing device 10, but it always constitutes a separate entity within the computer architecture. The central processing device 10 comprises at least one array of processing units and data transferring elements connected to the outermost processing units. The term "outermost processing unit" refers not only to those processing units which are physically located at the edges of the array but also to those through which the data stream enters and leaves the array of processing units. The computer architecture according to the invention comprises an instruction array implementing a computer program on the architecture and defining the traverse route of the data stream. The term "traverse route" is used to refer to the route along which the data stream travels through the units of the architecture, particularly through the central processing device and the at least one data storage device, i.e., in essence, the topology of the computational structure of the task represented by the computer program. Each time a computer program is run, a hardware structure is prepared by the architecture according to the invention, i.e. the traverse routes of the data streams are planned (which implies that the structure may be different from one program run to the next).

The data storage device 12 of the computer architecture according to the invention comprises a storage unit adapted for storing the data stream, and a sorting unit 14 adapted for reordering the data elements of the data stream. This configuration of the data storage device according to the invention, namely, the integration of the storage unit and the sorting unit in the storage device allows for extremely fast sorting of the data elements by the storage device. Accordingly, in the architecture according to the invention a data stream that has unordered data elements when entering the data storage device leaves it in a properly ordered state. By integrating the data sorting functionality in the data storage device the data sorting operation - which influences the efficiency of architectures applying processing unit arrays to a large extent - can be performed extremely efficiently in the architecture according to the invention. The number of the data storage device 12 connected to the central processing device 10 is typically between four and eight, but more data storage device 12 may also be connected. In the data storage device of the computer architecture according to the invention the role of the sorting unit is to prepare the data streams, and, expediently, to control that proper data supply of the central processing device is provided. The source of a data stream is a peripheral, preferably data storage device, and the data stream is terminated also at a peripheral, while it may "flow" (go) through different data storage device with the processing units performing calculations on it. In case the source and destination locations of a data stream are other types of peripherals, then those peripherals usually emulate a data storage device. The task to be performed by the data storage device - that is, reordering the data elements of the data stream, i.e., usually either the sorting of the data elements or the generation of certain data patterns from the data elements - is also stored in the instruction array of the data stream. Data reordering significantly increases the processing speed and efficiency of the architecture according to the invention. Performing a task may in some cases require sending a given data element multiple times. The data stream is defined by forwarding the instruction array to the data storage device, and concatenating the instruction array and the data array in the given data storage device.

In known multiprocessor computer architectures based on data stream processing the data streams leaving the central processing device are rearranged in a proper order applying external devices. This reordering is called "coalescing" in literature. Sorting in order is required in order that the data elements are fed to the processing units in the order which is most appropriate for processing, and that reading and writing of the data storage device is performed through memory transfers using continuous memory areas. In the computer architecture according to the invention coalescing is performed by the data storage device. In a manner different from known architectures, instead of applying external device for ordering the contents of the data storage device, the contents thereof is sorted by a sorting unit, preferably comprising a single-purpose processor, that is integrated in the data storage device itself. Thereby, applying a so-called on-chip processor, data may be sorted orders of magnitude faster than in known devices. Accordingly, the computer architecture according to the invention comprises specialised data storage device, by way of example a specialised DRAM (Dynamic Random Access Memory) unit. Data storage device may be applied particularly expediently for resolving the problem of "local disorder" that occurs most frequently in relation to the architecture according to the invention.

The advantages of data storage device comprising a sorting unit may be exploited more fully the smaller the difference between the speed of the data bus and the data storage device. This is being achieved to an increasing extent by state of the art technology. In a computer architecture the maximum data transfer speed, i.e., the maximum amount of data that can be transferred by the data bus per unit time is determined by the width of the data bus. To achieve the required data processing speed, in the computer architecture according to the invention the data bus is typically much wider than in known architectures.

In the present embodiment the computer program is fed to the computer architecture applying control device 18 connected to the central processing device 10 via appropriate interfaces, such as, by way of example, through a PCI Express (Peripheral Component Interconnect Express) connection. In addition to that, peripherals 16 are also connected to the central processing device 10.

The instruction array is preferably arranged in the header portion of the data stream, the data array being preferably arranged in the central portion thereof. The data array comprises, arranged in a proper order, the data to be processed applying the architecture according to the invention. The header portion comprises the instructions for the data storage device, as well as the arithmetic and control instructions intended for the processing units. The central portion comprises the data elements arranged corresponding to the computational structure. The topology of the arrangement of the data elements essentially corresponds to the structure of the processing of the data stream as performed by the processing units. In some embodiments of the computer architecture according to invention the end portion of the data stream comprises in a route releasing array the instruction related to the unbuild (release) of the previously constructed hardware structure. In other embodiments of the architecture such data streams are applied wherein a given instruction always precedes the data element on which it is executed. Accordingly, in some embodiments the instruction array and the data array may not be separated from each other in the data stream, i.e. the elements of the instruction and data arrays may be arranged in an alternating manner.

The computer architecture according to the invention is adapted for processing data streams. A simple example of data stream processing is shown in Fig. 2, where data piece by data piece addition is performed on input data streams 22', 22" coming from two different sources applying an addition operator 24 implemented in the central processing device 10. As a result of the element-by- element addition, the data stream 26 is obtained from the data streams 22', 22". By way of example, the data streams have a width of 4x32 bits, but this parameter may change depending on the particular implementation. The choice of a data stream width of four units is motivated by 3D applications.

Fig. 3 shows the addition of the first and second channels of the four data channels of an input data stream 25', the result being passed to the first data channel of an output data stream 25". Arriving from the data storage device 2 the data stream 25' is passed through the central processing device 10 wherein it is processed at least partially. As it is shown in Fig. 3, the starting and ending points (and optional intermediate points) of the processing operation are implemented by data storage device 12, while the addition of the data channels is performed by the central processing device 10. The central processing device 10 may also be applied for performing much more complex operations than illustrated above.

The data storage device has a dual role: in addition to storage it also performs data reordering. In Fig. 4 a simple reordering example is presented utilising the data storage device 12. First, an input data stream 27' is stored in a storage unit (not shown) adapted for data stream storage in the data storage device 12, and when data are read out, an output data stream 27" is output by the sorting unit 14 of the data storage device 12. To minimise the processing time required by the data storage device 12, the storage unit and sorting unit of the data storage device 12 are preferably optimised for data movement and comparison. The data storage device 12 is capable of performing, e.g., such simple algorithms as maximum and minimum search, or index generation for data elements.

Fig. 5 illustrates how an input data stream 29' is sorted in order by the data storage device 12. In the sorting operation depicted in the figure, the indices are carried by one of the channels of the data stream 29'. The indices are assigned to the data elements 28 of the data stream 29', and the sorting unit 14 performs sorting of the data elements 28 in order of the indices, i.e. it is adapted for reordering the data elements 28 based on their indices. By sorting the indices in order, an output data stream 29" is generated by the data storage device 12.

Therefore, the data storage device of the architecture according to the invention may e.g. once again sort in order data elements that have become mixed after executing a loop. Data elements may also become mixed in programs without loops. It may also become necessary to reorder the data elements when, although the data elements have not become mixed, but a different ordering is expedient for the forthcoming processing steps to be carried out after the data stream will have left the data storage device. Since loops are fundamentally required for implementing non-trivial algorithms, most programs comprise loops. Only such non-trivial and loop-free programs may be conceived wherein the algorithm comprises the sorting of the data elements, i.e. sorting is required also in case of data streams generated by these types of programs.

Fig. 6 illustrates a detail of the internal structure of the central processing device applied in the computer architecture according to an embodiment of the invention. The central processing device comprises at least one array of interconnected processing units 30, 32 adapted for executing an operation on the data array based on the instruction array, and data transferring elements 34 adapted for transferring the data stream and connected to the outermost processing units 30, 32 of at least one array of processing units 30, 32. The data transferring elements 34 are connected to the outermost processing units 30, 32 preferably on at least two sides - in the present embodiment, on all four sides - of the at least one array of the processing units 30, 32. The more sides of the array has data transferring elements 34 connected, the more different possible routes there are for feeding the data stream arriving through ports 36 into the array, and the more outermost processing units 30, 32 may be utilised for providing entry points for the data stream into the array. The data stream flows through the data transferring elements 34 without being processed, or, in some embodiments wherein multiple data streams are processed in a time overlapped manner, the data streams may be crossed with each other at the data transferring elements 34.

Data stream-based processing significantly improves the capacity utilisation of the individual processing units. The data transferring elements 34 ensures that the data streams enter the array of processing units 30, 32 at proper locations, and thereby the processing units 30, 32 may be optimally assigned to the different data streams processed in a time overlapped manner by the central processing device.

In the embodiment shown in Fig. 6, the central processing device comprises a plurality of arrays of processing units 30, 32, and the arrays of processing units 30, 32 are connected to each other by means of the data transferring elements 34. According to the present embodiment, the data stream may go through more than one arrays of processing units 30, 32 before leaving the central processing device through the ports 36. The data stream may then be directed to data storage device or other peripherals connected to the ports 36, i.e., the central processing device communicates with the components connected thereto via the ports 36. The central processing device is in communication with the other units through a very large number of ports 36. By way of example, in case of a square arrangement of the processing units the number of ports 36 is 4^*VM, where M is the number of processing units.

The data transferring elements 34 are arranged along the edges of the arrays of the processing units 30, 32 as well as between the arrays in order that the data streams may enter the arrays of the processing units 30, 32 at optimal locations. As described above, this may be achieved applying the data transferring elements 34 also by means of crossing the data streams. In case two data streams are crossed with each other, there is no "interaction" between them, i.e. they are crossed with each other without exchanging data.

It is important to note that, since the generation of traverse routes requires relatively few resources and little time, i.e. the delay is tolerable, the processing unit of the computer architecture according to the invention scales well, that is, the "computing surface" may be extended simply by connecting arrays of processing units with data transferring elements and placing them beside each other in a modular manner.

The processing units are generally arranged in arrays of n^*m elements. In the embodiment illustrated in Fig. 6 n=m=4, but n and m may be chosen freely. According to the embodiment shown in Fig. 6 the array of processing units 30, 32 has the processing units 30 and an enhanced functionality processing unit 32 arranged in a square layout, the enhanced functionality processing unit 32 being arranged in one of the corners of the array of square layout. In the figure P denotes the processing units 30, with G denoting the enhanced functionality processing units 32. The enhanced functionality processing units 32 have all the functionality of the processing units 30. In addition to that they may comprise conditional flow control functionality, which renders them particularly capable of implementing certain junction types. The different junction types are described in detail below. Certain junction types may also be implemented applying the processing units 30. The processing units 30, 32 may be different also in the manner their multiplexers are controlled. The multiplexers of the processing units 32 can be controlled at a higher level, for example, it may be specified by the data stream from which direction the multiplexer should receive a data element, or in which direction should it pass it on.

Processing units preferably also comprise programming logic. Programming logic is implemented as a register or registers storing parameters affecting the operation of a given processing unit. Programming logic is not shown in Fig. 9. The registers applied for implementing programming logic are partially in the local memory, and partially comprise electronics performing control operations. Due to the pipeline- type processing, the multiplexers applied in the architecture according to the invention may possess additional functionality compared to a typical multiplexer. The multiplexers essentially comprise LUTs (lookup tables), with the specific bits of the lookup tables indicating the action to be taken in different cases, i.e. it is specified which data elements should be fed to the outputs of the multiplexer. Thus, the LUTs are essentially instruction registers. Thereby, the control flow-type operations are preferably performed by the multiplexers in the processing units applied in the architecture according to the invention.

More than one - in the present embodiment, three - data transferring elements 34 are arranged along each of the outermost processing units 30, 32, with a respective one of the data transferring elements 34 arranged along the outermost processing units 30, 32 being connected to the given outermost processing unit 30.

Fig. 7 shows a detail of Fig. 6 showing interconnection elements 38. As it is shown in the figure, in the present embodiment the interconnection elements 38 are applied for providing passage of the data stream between the processing units 30, 32, between the processing units 30, 32 and the data transferring elements 34, and also between the data transferring elements 34. In the present embodiment four interconnection elements 38 - one on each side of the processing units 30, 32 - are connected to each processing unit 30, 32. However, an arbitrary number of interconnection elements 38 may be included such that further interconnections may be formed between the adjacent processing units 30, 32.

In Fig. 6 three dot denotes are included to indicate that the structure of the central processing device may be extended by the required amount in the indicated directions. Accordingly, in the directions indicated by the dots further arrays of processing units 30, 32 may be arranged, preferably according to the structure illustrated in the figure. The processing unit 30, 32 arrays are surrounded on all sides by data transferring elements 34, and the pattern may be continued accordingly. Preferably, ports 36 are connected on all sides to the outermost data transferring elements 34 of the central processing device, preferably in a configuration shown in the figure, i.e. on the right as well as at the bottom.

Fig. 8 illustrates a traverse route 31 of a data stream belonging to a simple exemplary algorithm. As shown in the drawing, data streams 35', 35" go through the data transferring elements 34 and through some of the processing units 30 which perform calculation operations on the data streams 35', 35". At a junction 33 the data streams 35', 35 merge into a single data stream 40 that leaves the central processing device via a port 36 as shown in the figure. In the figure, the data streams 35', 35", 40 are illustrated by their traverse routes 31.

The application of data streams implies that the program implemented by the instruction array travels together with the data comprised in the data array. Accordingly, the tasks to be performed by the individual processing units are assigned by the respective data streams, which means that the processing units receive the instructions from the data streams. It is therefore not necessary to globally control the processing units: the application of data streams involves that the processing units are controlled exclusively in a local manner. The route to be followed by a data stream is defined by the instruction definition device, i.e. it generates the architectural arrangement required for performing the given calculations. The program is implemented on the architecture by the instruction definition device in a manner known per se. Preferably, the instruction definition device does not have a significant amount of storage, i.e. it may only store the instructions carried by the program, from which instructions it can generate an instruction array that it can pass on, for instance in order to be concatenated with a data array. The instruction definition device is therefore expediently a processing device optimised for performing dedicated tasks, i.e. it is typically not suited for performing general-purpose operations. In the architecture according to the invention the instruction array and the data array may be concatenated in multiple ways. The data stream may arrive from an external peripheral device, not yet comprising an instruction array. Thus, when the stream arrives at a location where no further route has been programmed for it, it will stop. An instruction array implementing the program is then sent to that specific location by the instruction definition device, and the instruction array and the data array are concatenated in the data stream. In this case the instruction array gets to its destination location via the data transferring elements. The instruction array is preferably capable of releasing the route which it has traversed, and thus it makes use of the data transferring elements only for a limited time. Thereby, the instruction definition device generates only the instruction array, which is then sent to that location of the architecture where the data array is residing, and there it is concatenated before or inside the data array. The instruction array may also get to the data storage device where the concatenation of the instruction array and the data array may be performed.

The processing units perform operations on the data array of the data stream as specified by the program implemented in the instruction array. Upon its arrival the data stream is sorted in the order required by the program. Thus, the processing units do not have to wait for data, and - in contrast to certain known architectures - there is also no need for global interconnections in order to generate the architectural arrangement. The number of operations to be performed on the data by the processing units may vary depending on the task to be accomplished. This is to mean that, depending on the workload of the processing units, the path of the data elements propagating along the traverse route of the data stream may be modified dynamically, limited only by the characteristics of the route. Thereby it may happen that, due to the high workload of a given processing unit, some data elements are processed by another processing unit located along the route of the data stream instead of the unit under high load. The traverse route of the data stream - preferably comprising junctions, bypass routes, and closed loops corresponding to program loops - is therefore defined beforehand by the instruction definition device. The trajectories to be followed by each data element traversing a given route may be determined dynamically, expediently as a function of the workload of the processing units along the route.

In addition to what is shown in Fig. 8, the data streams being processed may have further merges and junctions in the central processing device. Two data streams may e.g. merge in case they previously branched off from the same data stream. In case of the merge shown in Fig. 8, the route 31 is planned such that the data streams 35', 35" merge at their stage of processing that is required by the computer program applied for defining the data streams. The computer program may require the implementation of a loop, in which case the route of the data stream is a closed path. A data stream enters at one point of the closed path, with the data processed in the loop leaving with the output data stream connected at another point of the closed path. Loops have the important property that in them data become mixed, and thus the data sorting capability of the data storage device plays an especially important role in case the data stream comprises loops.

Fig. 9 illustrates the internal structure of a processing unit 42 of an embodiment of the invention. The internal structure of the processing units 30, 32 may be identical to the structure of the processing unit 42. The processing unit 42 comprises an input multiplexer 46 adapted for processing inputs 44', 44", local data storage device 47 connected to the input multiplexer 46, an operation execution unit 48 connected to the output of the input multiplexer 46, and an output multiplexer 54, connected to the output of the operation execution unit 48, which is adapted for defining outputs 56', 56" of the processing unit 42. The operation of the multiplexers is known per se, i.e. a certain combination of the inputs thereof is fed to their output.

By way of example, the operation execution unit 48 may be an ALU and/or an FPU. The ALUs or FPUs currently applied - by way of example, in GPUs - may be applied as the operation execution units of the computer architecture according to the invention, although some small modifications may become necessary for their application. In case an FPU is applied as the operation execution unit 48, the processing unit 42 may for instance be utilised for applications requiring 3D rendering, and for performing scientific calculations and simulations. As ALUs are smaller in size than FPUs, ALUs may be more easily integrated in the processing unit 42. Embodiments wherein the processing unit 42 comprises exclusively ALUs may also be conceived. In this case the computer architecture is expediently applied for Ethernet routing, cryptography, or for the management of large-sized databases.

The input multiplexer 46 passes on to its outputs those data elements of the data stream(s) arriving to its input 44', 44" which is processed by the operation execution unit 48 of the given processing unit 42 according to the instruction array of the data stream. According to the corresponding instructions stored in the instruction array the parameters and constants stored in the data storage device 47 may also be applied for controlling the input multiplexer 46 and the output multiplexer 54. The local data storage device 47 is preferably implemented as a read-only memory unit having a size of 128-256 bits. The local data storage device 47 typically stores some constants that are initialised when the header portion of a data stream enters the given processing unit 42. The constants passed on to a processing unit in the header portion of the data stream need not be passed on again (together with the data array) to the given processing unit. In the present embodiment the inputs 44' are arranged at the top side of the processing unit 42, with the inputs 44", the outputs 56' and 56" being arranged, respectively, at the left, bottom, and right sides thereof. Thereby, in the embodiment illustrated in the drawing the inputs 44', 44" of the processing unit 42 are connected, to the processing unit 30, 32, 42 disposed to the left of and above the processing unit 42, or to a data transferring element 34, with the outputs 56', 56" of the processing unit 42 being connected to the processing unit 30, 32, 42 disposed to the right and below the processing unit 42 in question, or to a data transferring element 34. In other embodiments of the invention the inputs and outputs may be arranged in a different manner. In some embodiments inputs and outputs are arranged at each side of the processing unit. In these latter embodiments, two-way data exchange between adjacent processing units is provided for. By arranging the inputs and outputs in different ways, arrays capable of generating special types of routes may be created. For implementing loops, such central processing device are required which are capable of generating routes that comprise closed paths. To form closed paths it is essentially required that the outputs of certain processing unit are arranged differently from the manner illustrated in Fig. 9.

In the embodiment according to Fig. 9 the input and the output of the operation execution unit 48 are interconnected by means of a bypass interconnection 50 and a feedback interconnection 52. If so required, the operation execution unit 48 may be bypassed through the bypass interconnection 50, and the feedback interconnection 52 may be applied to send feedback from the output of the operation execution unit 48 to the input multiplexer 46.

In Fig. 10A an exemplary internal structural arrangement of an FPU performing the function of the operation execution unit 48 is shown. The FPU shown in Fig. 10A comprises a multiplier unit 60, a comparing unit 62 (comparator), an addition unit 64, and another comparing unit 66, which are interconnected in a manner illustrated in the drawing, and generate the outputs from their inputs accordingly.

The computer architecture according to the invention expediently works based on the so-called "pipeline-parallel" principle, and therefore in certain embodiments of the computer architecture pipeline stages are implemented. Data elements are constituted by data (even a single piece of data) that are processed by a single pipeline stage in a single unit of processing time. In certain embodiments of the architecture according to the invention, therefore, the processing units and the data transferring elements comprise pipeline stages that are adapted for defining the stages of a route and comprise a storage element adapted for storing the data elements. The storage elements of the neighbouring pipeline stages are adapted for passing the data elements to each other. During the processing of the data stream the individual data elements are stored in the storage elements of the pipeline stages defining the stages of the route, and, after at least one unit of processing time has elapsed, are passed to the storage element of the pipeline stage defining the next stage of the route. Some aspects of pipeline processing that are crucial for the invention are presented below. ln Fig. 10B the internal structural arrangement of the FPU according to Fig. 10A is shown, illustrating the pipeline stages 68. The number of the pipeline stages 68 implemented in the individual units depends on the chosen technology. The allocation of pipeline stages 68 shown in the drawing is typically applied in case clock frequencies around 1 GHz are utilised. The multiplier unit 60, the addition unit 64, and the comparing unit 66 shown in Fig. 10B comprise several respective pipeline stages 68.

According to a further aspect of the invention, global wiring is not comprised in some embodiments of the inventive computer architecture. Thereby, in some data streams delays may occur during processing. The omission of global wiring also eliminates the limitations on clock frequency imposed by the existence of global wiring. The clock signal of the computer architecture is defined by the longest data element transfer time between neighbouring pipeline stages 68. Data element transfer time encompasses a processing time unit and the time required for transferring the data element between two neighbouring pipeline stages. Such pipeline stages may also exist which require longer processing time (multiple clock cycles). By definition, data elements stay at these pipeline stages for a duration that is longer than one unit of processing time, but this increased retention period does not affect the selection of clock cycle time (i.e. clock frequency is not reduced further). The time for which data elements stay at a pipeline stage is typically identical for different data elements, that is, the architecture has a single characteristic unit of processing time, and thus the clock rate of the architecture according to the invention may be determined easily. However, in case asynchronous behaviour occurs during processing in the architecture, or the transfer time between adjacent pipeline stages is different for different pipeline stages of the architecture, then the clock rate is determined by the longest data element transfer time, i.e. the "longest path" between two adjacent pipeline stages. Thereby, in addition to the reduction of the length of the processing time unit, clock rate can be increased also in case the pipeline stages are distributed densely, such as in the operation execution unit 48 shown in Fig. 10B.

The application of local interconnections between the processing units allows that the clock signal may be fed to the processing units without synchronisation. In some embodiments of the architecture according to the invention local interconnections are applied, and thus the clock signal propagates along the computer architecture as a "local wave". Since in the architecture according to the invention no limitation is imposed on clock frequency by the capacity of global wiring, due to the applicable higher processing frequencies the power consumption of the architecture may be reduced compared with known solutions applying global wiring. If no global clock wiring is included, it is supposed that all data streams travel at the same speed, i.e., by way of example, the data streams arriving at a junction reach the given junction at the same instance. In the absence of global wiring the architecture is sensitive to delays. In the embodiments of the invention wherein no global wiring is included the problem of the sensitivity to delays of the computer architecture is addressed as described below (in the embodiments of the invention applying pipeline-based processing).

By "pipeline-based processing" it is meant that in the components of the computer architecture adapted for data stream processing and data stream transfer, the data streams are passed along pipelines. These components are typically the processing units and data transferring elements of the central processing device, as well as the data storage device. These components comprise pipeline stages: in the processing units typically more pipeline stages are implemented, while in each data transferring element a single pipeline stage is included. Each pipeline stage is adapted for processing and/or storing a single data element. For storing the data element, each pipeline stage comprises a storage element, by way of example a D flip-flop-type storage element. The route of the data stream may be defined by determining the pipeline stages to be successively passed. The pipeline stages to be successively passed are adjacent/neighbouring pipeline stages which may be implemented in either a single component or multiple interconnected components. A pipeline stage may perform a processing operation on the data stream, but its functionality may be also be limited to transferring a data element, such as in case of the pipeline stages of the data transferring elements or data storage device.

In some embodiments of the invention the data stream is constructed such that under normal processing of the data stream the data elements of the data stream are stored in the storage element of every second pipeline stage along the route of the data stream. The term "normal processing" refers to the case wherein the data stream passes unobstructed along its route defined in the computer architecture. The term "half-speed processing" refers to the case wherein every second pipeline stage is filled with a respective data element, i.e. an empty pipeline stage alternates with a pipeline stage filled with a data element.

The application of half-speed processing as normal processing is preferable because thereby the data stream is able to "stop" in case it has to wait because of a loop, a congestion, or for other reasons. In case a data stream is "put on hold", it ceases to be processed under normal processing: data elements located upstream of the congestion may make a step ahead into the next, empty pipeline stage, and thereby the processing of the data stream may continue even in case the route of data stream gets congested or the stream is "put on hold". The "hold" signal expediently sent in such a case starts to propagate backwards along the data stream. In case the route of the data stream comprises junctions, such a "hold" signal causes some of the data elements to choose a route bypassing the congestion at the next junction, and as a result they are processed along the chosen bypass route. Thereby, congestions may be effectively mitigated by introducing junctions and bypass routes. Locations potentially prone to congestion may preferably be identified and taken into account by the instruction definition device, which may define the processing route such that a bypass route is available for congestion mitigation. Thereby, real-time route optimisation for the data elements is essentially implemented. After the congestion has been eliminated, the normal processing regime of the data stream is gradually resumed. The built-in bypass route need not have the same length as the route to be traversed under the normal regime, as mixing of the data elements caused by the length difference may be eliminated effectively and in a virtually "cost-free" manner applying the data storage device of the architecture according to the invention.

Since in a half-speed processing configuration each data element is surrounded by empty pipeline stages, the application of half-speed processing is preferable also because it renders the computer architecture insensitive to delays, allowing local delays even up to 90° without affecting data stream processing. Alternatively, full-speed processing is also conceivable, in which case all pipeline stages are filled by data elements. As described in detail above, full-speed processing is applied in case of congestions, when the pipeline stages originally left empty gradually become filled up. If full-speed processing is applied, however, in case of a congestion caused by a junction or a loop, processing of the data stream is halted, an thereby full-speed processing may only be applied in a problem-free manner for implementing programs involving performing simple arithmetic operations on the data stream without applying loops and other more complex structures. In case half-speed processing is applied a bandwidth loss occurs in comparison with the full-speed case, however - due to the above described advantages, and weighing the difficulties related to full speed processing and the advantages of omitting global wiring - in the architecture according to the invention, half-speed processing is more effective than the application of full-speed processing. In Fig. 11 the schematic block diagram of the data storage device 70 according to an embodiment of the inventive architecture is shown. The data storage device 70 comprises a comparing unit adapted for comparing data elements, and the data storage device 70, the comparing unit, and the sorting unit is controlled by a control unit 74. The control unit 74 is adapted for running the program segment intended for the data storage device 70 that is received by the data storage device 70 in the form of the instruction array. In the present embodiment the data storage device 70 also comprises a data link controller unit 78. The data link controller unit 78 is connected to the data bus through which the data storage device 70 may be connected to a desired port of the central processing device. The data link controller unit 78 prepares the data streams and send them to the central processing device. For performing this function the data link controller unit 78 comprises buffer storage. Data transfer is expediently performed applying fibre- optic data transmission utilising laser transceivers.

In the embodiment shown in Fig. 11 the comparing unit is integrated in a storage unit 72. According to the present embodiment, instead of applying a homogeneous memory array structure, the storage unit 72 has a structure optimised for sorting. The comparing unit is applicable for the data sorting procedure, comparison being one of the most important operations of data sorting. Comparison is performed by the comparing unit parallel with reading out data from the storage unit.

The data storage device 70 further comprises an operation execution unit 76. By way of example, the operation execution unit 76 may be an ALU or an FPU. The operation execution unit 76 has a computing power that is minimally sufficient for performing the sorting operation. In the present embodiment, the sorting unit of the data storage device of the architecture according to the invention is implemented by the comparing unit and the operation execution unit 76. In Fig. 11 the interconnections linking the components of the data storage device 70 are also shown. In a manner applied in known architectures, the sorting operation performed by the data storage device 70 may also be performed by the central processing device, albeit in a significantly less effective manner than it is done by the data storage device according to the invention. In case sorting is performed applying the central processing device, the processing of the data stream is delayed in a manner similar to known architectures, and therefore in the architecture according to the invention the data elements of the data stream are preferably sorted in order applying the data storage device. Reordering of the data element is performed applying the data storage device of the architecture according to the invention such that, in a first step, the input data stream entering the data storage device is stored in the storage unit thereof, with reordering - by way of example, as in case of the example illustrated in Fig. 5, sorting in order - being performed by the sorting unit of the data storage device upon data readout.

Some embodiments of the invention relate to a processing method applying a computer architecture. The processing method according to the invention is adapted for programming a computer architecture comprising a central processing device adapted for processing a data stream consisting of data elements and comprising an instruction array and a data array, and at least one data storage device and instruction definition device connected to the central processing device. The central processing device of the computer architecture comprises at least one array of processing units connected to each other and being adapted for executing an operation on the data array based on the instruction array, and data transferring elements connected to the outermost processing units of the at least one array of processing units and/or to each other via interconnection elements. The data storage device comprises a storage unit adapted for storing the data stream, as well as a sorting unit adapted for reordering the data stored in the data array. In the course of the method the data stream is defined by concatenating the instruction array and the data array, transferring the data stream along a route, it is processed by means of the central processing device, and the instruction definition device is applied for defining an instruction array implementing a computer program on the architecture, and for determining the traverse route of the data stream.

In some embodiments of the processing method according to the invention the processing units and the data transferring elements comprise pipeline stages adapted for defining stages of a route and having a storage element adapted for storing the data elements, the storage elements of the neighbouring pipeline stages being adapted for transferring the data elements to each other, and, in course of processing of the data stream each of the data elements of the data stream being stored in the storage elements of the pipeline stages defining the stages of the route, and, after elapsing of at least one processing time unit, each of the data elements are passed to the storage element of the pipeline stage defining the next stage of the route.

In some embodiments of the method according to the invention in a normal operation of the processing, each of the data elements of the data stream are stored in every second storage element, respectively. In further embodiments of the processing method according to the invention an index is assigned to each of the data elements, and the data elements are reordered based on their indices by means of the sorting unit.

In some embodiments of the processing method according to the invention, in course of the method instruction arrays of a plurality of data streams are defined by means of the instruction definition device, a plurality of data streams are defined by concatenating the respective instruction arrays and data arrays, and the plurality of data streams are processed in a time overlapped manner applying the central processing device.

In some embodiments of the processing method according to the invention in course of processing the data streams the routes of the data streams are crossed with each other on a data transferring element. In other embodiments of the method according to the invention a clock signal of the computer architecture is defined by the longest data element transfer time between neighbouring pipeline stages.

The computer architecture according to the invention is applied for processing a computer program as follows. The computer program is typically comprised in a message arriving from a control device or a computer device comprising the computer architecture. The message comprises the code (control instructions) intended for the data storage device, as well as the arithmetic operations to be performed by the central processing unit. The message is preferably encoded in a graph representation, which is processed by the computer architecture in a manner characteristic of the structure corresponding to the given embodiment thereof. In case the program is encoded in a graph representation the data stream can be mapped easily onto processing units since the flow of the data stream along the processing units can essentially be represented with a graph. The graph representation is a general low-level representation of the program that is not yet mapped topological^ onto the array. The computer program is fed to the instruction definition device, which adds to it the structure which specifies the route of the data stream, i.e., by way of example, the path it should take towards the central processing device, the route along which it should flow in the central processing device, the data storage device it should pass, etc. This structure is dependent on the structure of the computer program, and also on the other programs already running on the computer architecture, that is, the routes of other data streams that implement other programs.

Being implemented in the instruction array, the computer program is sent from the instruction definition device to the storage location of the data to be processed, i.e. typically to a peripheral, e.g. to a data storage device. During the program run, the data stream starts from the peripheral, and, flowing through the central processing device and the data storage device (even multiple times) it arrives in the peripheral or to the data storage device. Multiple data storage device may be arranged along the route of a given data stream, but a given data storage device is typically used by only one program at a given instance. It may happen that a given data storage device is used by more than one programs simultaneously, but this may strongly deteriorate the performance of the computer architecture. It is therefore preferred to apply multiple data storage device in the architecture.

After flowing through the peripheral unit and/or the data storage device, on the one hand the data stream at least partially leaves behind the code intended for the data storage device, and, on the other hand, it "pulls after itself the data. Based on the instruction array the calculation topology is generated as the data stream flows through the central processing device. In the central processing device the calculation operations as well as the parts encoding the calculation topology are stripped from the instruction array. The instruction array expediently ends with an activation instruction, after which comes the data array comprising the data to be processed by the processing units. The data stream may also comprise a junction or junctions, in which case the programming of the data stream system is a more complex task.

From a functional point of view, junctions can be divided into multiple types, which may also be combined:

- Program-junction: this is a junction of the graph representing the computer program, i.e. it is not a data stream junction.

- Calculation junction: supposing that two-operand operations are applied, there can be two cases. In the first case the data stream continues to flow in both directions (it is copied). In the second case the data stream arrives from two different directions, and is processed.

- Conditional junction: two different cases have to be dealt with here as well.

In the first case the data elements are congested in two incoming directions, and a predetermined condition is applied to decide which one is passed on. In a second case two directions are available for sending data, and an arithmetic condition is applied for deciding the outbound direction of each data element. - PHI/Fork junction:

o PHI: the data elements arriving from two different directions are sorted according to their priority, e.g. in an alternating manner, o Fork: data elements may be directed in two different directions according to their priority or being sent in the direction where there is a free space (data elements may be directed in the two directions in an alternating manner in case both routes are available).

Fig. 12 shows an algorithm for calculating the Mandelbrot set implemented by an embodiment of the computer architecture according to the invention. In mathematics, the Mandelbrot set comprises those complex numbers, i.e. it is the set of those points of the complex plane, for which the following recursive complex sequence τ_η+ι := (Xn + c

does not approach infinity. Plotting the Mandelbrot set on the complex plane the well-known Mandelbrot fractal is obtained. The Mandelbrot fractal is plotted on a complex plane, i.e. in a coordinate system with coordinates x and y. The fractal is drawn by performing an iteration using complex arithmetic at each pixel, and stopping the iteration when the absolute value of the given complex number passes a threshold value. The number of iterations will give the colour for the given pixel of the Mandelbrot set. To keep the number of iterations at a finite value, an iteration limit is applied. The conventional C source code of calculating the Mandelbrot set is the following: int mandelbrot ( double x , double y )

{

double z1 = 0 , z2 = 0;

int iter = 0;

double len;

do{

double a = z1 * z1 - z2 * z2 + x;

double b = 2 ^* z1 ^* z2 + y;

z1 = a; z2 = b;

ien = z1 * z1 + z2 * z2 ;

iter++;

}while ( iter < 200 and Ien < 16 );

return iter ;

}

The junctions 80 and operational logic units 82, 88 shown in Fig. 12 by rectangles and squares, respectively, are implemented utilising processing units. Depending on the configuration of the processing components, more than one operational logic units may be implemented applying a single processing unit.

The junctions 80 shown in the figure implement the functionality of the above described PHI-type junction, i.e. the junctions 80 perform data stream merging based on stream priority. To allow for the implementation of loops, feedback data streams (indicated in the drawing by filled triangles) have priority at the junctions 80. Without such prioritisation of the feedback data streams a so-called deadlock would occur, i.e. the data streams would be waiting for each other indiscriminately. Thereby, in case a data stream arrives at a merge junction 80 via the prioritised line indicated by a triangle, then it is made sure that the prioritised data stream is passed on while a data stream potentially arriving at the other line is put on hold.

A corresponding sign shown in the boxes representing the operational logic units 82 in the figure indicates the basic arithmetic operation (addition, subtraction, or multiplication) to be performed by the particular unit. An operational logic unit 88 bearing the sign < performs a comparison, while the operational logic unit 88 labelled "AND" performs a logical AND operation. The constants 84 are stored locally in the respective local data storage device of the given processing units. Junctions 86 indicated by a trapezoid are adapted for implementing conditional junctions. At the junctions 86, the direction in which the data stream or specific data elements thereof will pass on is decided by a logical operation, a so-called loop termination condition. The variables computed by the different junctions are indicated at the inputs and the outputs. The initial values of the variables x and y specifying the pixel coordinates are generated as the actual pixel coordinates by the data storage device, while the initial value of the other variables is zero. The iterated final end results, going to a data storage device, are fed to the outputs. The results are then sorted to place according to coordinates by the data storage device such that the image showing the Mandelbrot fractal may be read out continuously. As no information is utilised further except the pixel coordinates and the iteration number, the outputs that are not required later can be discarded.

The above program adapted for calculating the Mandelbrot fractal demonstrates the mixing of data elements in loops, since, depending on the iteration number, each coordinate pair stays in the loop for a different time period. The coordinates of the pixels to be calculated enter the loop in an ordered manner, but the coordinate data of the pixels for which calculation is finished sooner leave earlier, thereby "overtaking" the pixel coordinate pairs which are calculated more slowly. Since the difference between data element transfer times is limited, the mixing of the data elements remains local. The example illustrated in Fig. 12 also emphasises the pipelined nature of the architecture according to the invention: a large number of data elements go round in the loop in a time overlapped manner, all the operations being carried out parallel. During the calculation of the Mandelbrot fractal each processing unit is placed under load almost continuously, and thus the capacity utilisation of the architecture is very high.

The invention is, of course, not limited to the preferred embodiments described in details above, but further variants, modifications and developments are possible within the scope of protection determined by the claims.

Claims

A computer architecture comprising

- a central processing device (10) adapted for processing a data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) consisting of data elements (28) and comprising an instruction array and a data array, said central processing device (10) comprising

- at least one array of processing units (30, 32, 42) connected to each other and being adapted for executing an operation on the data array based on the instruction array,

- data transferring elements (34) connected to the outermost processing units (30, 32, 42) of the at least one array of the processing units (30, 32, 42) and adapted for transferring the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40), and

- at least one data storage device (12, 70) connected to the central processing device (10),

c h a r a c t e r i s e d in that

- the computer architecture comprises an instruction definition device (20) adapted for defining an instruction array implementing a computer program on the architecture and for determining a traverse route (31) of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40), said instruction definition device (20) is connected to the central processing device (10), and

- the data storage device (12, 70) comprises

- a storage unit (72) adapted for storing the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40), and

- a sorting unit (14) adapted for reordering the data elements (28).

The architecture according to claim 1 , characterised in that

- the processing units (30, 32, 42) and the data transferring elements (34) comprise pipeline stages (68) having a storage element adapted for storing the data elements (28) and adapted for defining the stages of the route (31), - the storage elements of the neighbouring pipeline stages (68) are adapted for transferring the data elements (28) to each other, and

- in the course of processing of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) each of the data elements (28) are stored in the storage elements of the pipeline stages (68) defining the stages of the route (31), and, after elapsing of at least one processing time unit, each of the data elements (28) are passed to the storage element of the pipeline stage (68) defining the next stage of the route (31).

3. The architecture according to claim 2, characterised in that in course of processing of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) in a normal operation of the processing, each of the data elements (28) of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) are stored in the respective storage element of every second pipeline stage (68).

4. The architecture according to any of claims 1 to 3, characterised in that an index is assigned to each of the data elements (28), and the sorting unit (14) is adapted for reordering the data elements (28) based on their indices.

5. The architecture according to any of claims 1 to 4, characterised in that the route (31) comprises junctions (33) and/or bypass routes.

6. The architecture according to any of claims 1 to 5, characterised in that the processing unit (30, 32, 42) comprises

- an input multiplexer (46) adapted for processing the inputs (44', 44") of the processing unit (30, 32, 42),

- a local data storage device (47) connected to the input multiplexer (46),

- an operation execution unit (48) connected to the output of the input multiplexer (46), and

- an output multiplexer (54) that is connected to the output of the operation execution unit (48) and is adapted for determining the outputs (56', 56") of the processing unit (30, 32, 42).

7. The architecture according to any of claims 1 to 6, characterised in that the data storage device (12, 70) comprises a comparing unit adapted for comparing data elements (28), and the data storage device (12, 70), the comparing unit and the sorting unit (14) are controlled by a control unit (74).

8. The architecture according to any of claims 1 to 7, characterised in that the central processing device (10) is adapted for time overlapped processing of multiple data streams (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40).

9. The architecture according to claim 8, characterised in that in the course of processing of the data streams (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) the routes (31 ) thereof are crossed with each other at a data transferring element (34).

10. The architecture according to any of claims 1 to 9, characterised in that a clock signal of the computer architecture is determined based on the longest data element transfer time between neighbouring pipeline stages (68).

11. The architecture according to any of claims 1 to 10, characterised in that data transferring elements (34) are connected to the outermost processing units (30, 32, 42) on at least two sides of at least one array of the processing units (30, 32, 42).

12. The architecture according to any of claims 1 to 1 1 , characterised in that the central processing device (10) comprises multiple arrays of processing units (30, 32, 42), and the arrays of processing units (30, 32, 42) are connected to each other by means of the data transferring elements (34).

13. The architecture according to any of claims 1 to 12, characterised in that the array of processing units (30, 32, 42) comprises - being arranged in a square layout - the processing units (30, 42) and an enhanced functionality processing unit (32), said enhanced functionality processing unit (32) is arranged in one of the corners of the array arranged in the square layout.

14. The architecture according to any of claims 1 to 13, characterised in that more than one, preferably three data transferring elements (34) are arranged along each of the outermost processing units (30, 32, 42), and one of the data transferring elements (34) out of the data transferring elements (34) arranged along each of the outermost processing units (30, 32, 34) is connected to a given outermost processing unit (30, 32, 42).

15. The architecture according to any of claims 1 to 14, characterised in that the instruction array is arranged in a header portion of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) and the data array is arranged in a central portion of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40).

16. The architecture according to claim 15, characterised in that the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) comprises a route releasing array arranged at an end portion thereof.

17. A processing method applying a computer architecture, said computer architecture comprising

- at least one array of processing units (30, 32, 42) connected to each other and being adapted for executing an operation on the data array based on the instruction array, and

- data transferring elements (34) connected to the outermost processing units (30, 32, 42) of the at least one array of processing units (30, 32, 42) and/or connected to each other,

the method comprising the steps of

- defining the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) by concatenating the instruction array and the data array, and - processing the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) by means of the central processing device (10) as the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) is transferred along a route (31),

c h a r a c t e r i s e d in that the architecture comprises an instruction definition device (20) connected to the central processing device (10), and the data storage device (12, 70) comprises a storage unit (72) adapted for storing the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) and a sorting unit (14) adapted for reordering the data stored in the data array; and in the course of the method

- the instruction definition device (20) is applied for defining an instruction array implementing a computer program on the architecture and for determining the traverse route (31) of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40).

18. The method according to claim 17, characterised in that

- the processing units (30, 32, 42) and the data transferring elements (34) comprise pipeline stages (68) having a storage element adapted for storing the data elements (28) and adapted for defining the stages of a route (31), the storage elements of the neighbouring pipeline stages (68) being adapted for transferring the data elements (28) to each other, and,

19. The method according to claim 18, characterised in that in course of processing of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) in a normal operation of the processing, each of the data elements (28) of the data stream (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) are stored in every second storage element, respectively.

20. The method according to any of claims 17 to 19, characterised by assigning an index to each of the data elements (28), and the data elements (28) are reordered by the sorting unit (14) based on their indices.

21. The method according to any of claims 17 to 20, characterised in that in the course of the method

- the instruction definition device (20) is applied for defining the instruction arrays of a plurality of data streams (22', 22", 25', 25", 26, 27", 27", 29', 29", 35', 35", 40),

- a plurality of data streams (22', 22", 25', 25", 26, 27', 27", 29", 29", 35', 35", 40) are defined by concatenating the respective instruction arrays and data arrays, and

- the plurality of data streams (22', 22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) are processed in a time overlapped manner by means of the central processing device (10). 22. The method according to claim 21 , characterised in that in the course of processing of the data streams (22',

22", 25', 25", 26, 27', 27", 29', 29", 35', 35", 40) the routes (31) thereof are crossed with each other at a data transferring element (34).

23. The method according to any of claims 17-22, characterised by determining a clock signal of the computer architecture based on the longest data element transfer time between neighbouring pipeline stages (68).