US20100115234A1 - Configurable vector length computer processor - Google Patents

Configurable vector length computer processor Download PDF

Info

Publication number
US20100115234A1
US20100115234A1 US12/263,302 US26330208A US2010115234A1 US 20100115234 A1 US20100115234 A1 US 20100115234A1 US 26330208 A US26330208 A US 26330208A US 2010115234 A1 US2010115234 A1 US 2010115234A1
Authority
US
United States
Prior art keywords
vector
processor
changing
core
change
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/263,302
Inventor
Gregory J. Faanes
Eric P. Lundberg
Abdulla Bataineh
Timothy J. Johnson
Michael Parker
James Robert Kohn
Steven L. Scott
Robert Alverson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cray Inc
Original Assignee
Cray Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cray Inc filed Critical Cray Inc
Priority to US12/263,302 priority Critical patent/US20100115234A1/en
Assigned to CRAY INC. reassignment CRAY INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ALVERSON, ROBERT, BATAINEH, ABDULLA, FAANES, GREGORY J., JOHNSON, TIMOTHY J., KOHN, JAMES ROBERT, LUNDBERG, ERIC P., PARKER, MICHAEL, SCOTT, STEVE
Publication of US20100115234A1 publication Critical patent/US20100115234A1/en
Priority to US13/409,033 priority patent/US8601236B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors

Definitions

  • the invention relates generally to vector computer processors, and more specifically in one embodiment to a configurable vector length computer processor.
  • a typical instruction set includes a variety of types of instructions, including arithmetic, logic, and data instructions.
  • processors In more sophisticated computer systems, multiple processors are used, and one or more processors runs software that is operable to assign tasks to other processors or to split up a task so that it can be worked on by multiple processors at the same time.
  • the data being worked on is typically stored in memory that is either centralized, or is split up among the different processors working on a task.
  • Instructions from the instruction set of the computer's processor or processor that are chosen to perform a certain task form a software program that can be executed on the computer system.
  • the software program is first written in a high-level language such as “C” that is easier for a programmer to understand than the processor's instruction set, and a program called a compiler converts the high-level language program code to processor-specific instructions.
  • the programmer or the compiler will usually look for tasks that can be performed in parallel, such as calculations where the data used to perform a first calculation are not dependent on the results of certain other calculations such that the first calculation and other calculations can be performed at the same time.
  • the calculations performed at the same time are said to be performed in parallel, and can result in significantly faster execution of the program.
  • some programs such as web browsers and word processors don't consume a high percentage of even a single processor's resources and don't have many operations that can be performed in parallel, other operations such as scientific simulation can often run hundreds or thousands of times faster in computers with thousands of parallel processing nodes available.
  • Multiple operations can also be performed at the same time using one or more vector processors, which perform an operation on multiple data elements at the same time.
  • a vector instruction may add elements from a 64-element vector to elements from a second 64-element vector to produce a third 64-element vector, where each element of the third vector is the sum of the corresponding elements in the first and second vectors.
  • the vector registers each hold 64 elements, so the vector length is said to be 64.
  • the vector processor can handle sets of data smaller than 64 by using a vector length register specifying that some number fewer than 64 elements are to be processed, or can handle sets of data larger than 64 elements by using multiple vector operations to process all elements in the data set, such as by using a program loop.
  • the vectors in some further examples do not operate on elements that are sequential in memory, but instead operate on elements that are spaced some distance apart, such as on certain elements of a large array for scientific computing and modeling applications.
  • This distance between elements in a vector is referred to as the stride, such that sequential words from memory have a stride of one, whereas a vector comprising every sixteenth element in memory has a stride of 16.
  • Vector processing provides other benefits to program efficiency, but at the cost of significant load or startup time relative to a scalar operation. Although the vectors must be completely loaded from memory before functions can be performed on the elements, other steps such as checking for variable independence need only be performed once for an entire vector operation. Instruction and coding efficiency are also improved with vector operations, as is memory access where the vector has a known or consistent memory access pattern. Vector processor design choices such as vector length consider these efficiencies and tradeoffs in an attempt to provide both good scalar operation performance and efficient vector operation.
  • Some embodiments of the invention comprise a processor core that comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length.
  • Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.
  • FIG. 1 shows a reconfigurable vector space supporting four streams and a vector length of 16, consistent with an example embodiment of the invention.
  • FIG. 2 shows a reconfigurable vector space supporting 32 streams and a vector length of one, consistent with an example embodiment of the invention.
  • FIG. 3 shows a vector processor having configurable vector modes, consistent with an example embodiment of the invention.
  • Vector processor architectures often include vector registers having a fixed number of entries, each vector register capable of holding a single vector.
  • Vector functional units such as an add/subtract unit, a multiply unit and a divide unit, and logic operation units are either dedicated to serving vector operations or are shared with scalar operations.
  • Scalar registers are also used in some vector operations, such as where every element of a vector is multiplied by a scalar number.
  • An example processor might have, for example, eight vector registers with 64 elements per register, where each element is a 64-bit word.
  • One embodiment of the invention seeks to address problems such as this by providing a reconfigurable processor core, such as where a more vectorized and a less vectorized configuration are available within the same processor core and can be selected to improve application execution efficiency.
  • a processor chip contains 32 cores, where each core is capable of operating in either a vector threaded mode supporting four streams having a maximum vector length of 16, or a scalar threaded mode supporting 32 streams of a maximum vector length of one.
  • Each mode has the same instruction set architecture, same instruction issue rate, and same instruction processing performance, but will provide different application performance based on the parallelization or vectorization that can be achieved for a given application.
  • FIGS. 1 and 2 the vector registers and address registers allocated to different numbers of instruction streams are shown, demonstrating how an example register space is configured to facilitate changing vector modes.
  • 3,072 registers are organized as 96 registers with 32 elements each.
  • FIG. 1 shows the example register space configured to support four streams having a maximum vector length of 16
  • FIG. 2 illustrates the same register space configured to support 32 streams with a maximum vector length of one.
  • Vector registers allocated to each of four different instruction streams of the four-stream 16-element vector configuration are shown at 101 , each stream being allocated 32 registers having 16 elements each, such that there is a maximum vector length of 16. Address registers for each stream are allocated in register space 102 , but only consume two elements of 32 registers per stream—the remaining register space that is crossed out is unused in this vector mode.
  • the same vector register space is configured such that each of 32 streams is allocated vector register space having 32 elements each, for a total of 1024 registers.
  • the remaining 2048 registers are allocated as address registers as shown at 202 , such that each of the 32 streams is allocated 64 address registers.
  • the address registers and vector registers are a part of a processor core, as shown in FIG. 3 .
  • the XPipe element is the execution pipeline, as shown at 301 , and includes the address register/vector register space shown in FIGS. 1 and 2 at 302 in FIG. 3 .
  • the MPipe, or memory pipeline that includes the load/store unit of the processor is shown at 303
  • the IPipe or instruction pipeline is shown at 304 .
  • the instruction pipeline includes the instruction buffers and cache, and the instruction fetch and issue logic.
  • the processor core quiets all executing threads in the core being reconfigured, and flushes the registers.
  • the registers and instruction pipelines are reloaded under the new vector/stream mode, and execution is restarted.
  • Changing modes therefore involves repartitioning the register space and reassignment of registers to different streams, or between vector and address register allocation, depending on the embodiment being practiced.
  • the actual register space remains the same, as is illustrated in the example of FIGS. 1 and 2 , and the IPipe system remains the same but switches between four and 32 instruction streams based on the selected mode.
  • a variety of other necessary or optional changes such as changing a maxVL or maximum vector length register to reflect the new configured maximum vector length, are also employed in some embodiments, and are within the scope of the invention.
  • the processor of this example can therefore be configured for fine-grained or coarse-grained parallelism on the fly, even within an executing application.
  • the ability to configure the processor core on the fly, even within a job or application, provides greater flexibility and efficiency in execution than prior systems could provide.
  • the ability to switch modes on a core-by-core basis rather than on a system-by-system basis or chip-by-chip basis enables configuration of individual cores to best suit the applications assigned to those specific cores. For example, a processor chip containing 32 cores can configure 28 cores to work on a coarse-grained parallel application using a vector length of 16, while the remaining four cores execute fine-grained threads that do not lend themselves to vector parallelization as well.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

A processor core, comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length. Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.

Description

    FIELD OF THE INVENTION
  • The invention relates generally to vector computer processors, and more specifically in one embodiment to a configurable vector length computer processor.
  • LIMITED COPYRIGHT WAIVER
  • A portion of the disclosure of this patent document contains material to which the claim of copyright protection is made. The copyright owner has no objection to the facsimile reproduction by any person of the patent document or the patent disclosure, as it appears in the U.S. Patent and Trademark Office file or records, but reserves all other rights whatsoever.
  • BACKGROUND
  • Most general purpose computer systems are built around a general-purpose processor, which is typically an integrated circuit operable to perform a wide variety of operations useful for executing a wide variety of software. The processor is able to perform a fixed set of instructions, which collectively are known as the instruction set for the processor. A typical instruction set includes a variety of types of instructions, including arithmetic, logic, and data instructions.
  • In more sophisticated computer systems, multiple processors are used, and one or more processors runs software that is operable to assign tasks to other processors or to split up a task so that it can be worked on by multiple processors at the same time. In such systems, the data being worked on is typically stored in memory that is either centralized, or is split up among the different processors working on a task.
  • Instructions from the instruction set of the computer's processor or processor that are chosen to perform a certain task form a software program that can be executed on the computer system. Typically, the software program is first written in a high-level language such as “C” that is easier for a programmer to understand than the processor's instruction set, and a program called a compiler converts the high-level language program code to processor-specific instructions.
  • In multiprocessor systems, the programmer or the compiler will usually look for tasks that can be performed in parallel, such as calculations where the data used to perform a first calculation are not dependent on the results of certain other calculations such that the first calculation and other calculations can be performed at the same time. The calculations performed at the same time are said to be performed in parallel, and can result in significantly faster execution of the program. Although some programs such as web browsers and word processors don't consume a high percentage of even a single processor's resources and don't have many operations that can be performed in parallel, other operations such as scientific simulation can often run hundreds or thousands of times faster in computers with thousands of parallel processing nodes available.
  • Multiple operations can also be performed at the same time using one or more vector processors, which perform an operation on multiple data elements at the same time. For example, rather than instruction that adds two numbers together to produce a third number, a vector instruction may add elements from a 64-element vector to elements from a second 64-element vector to produce a third 64-element vector, where each element of the third vector is the sum of the corresponding elements in the first and second vectors.
  • In this example, the vector registers each hold 64 elements, so the vector length is said to be 64. The vector processor can handle sets of data smaller than 64 by using a vector length register specifying that some number fewer than 64 elements are to be processed, or can handle sets of data larger than 64 elements by using multiple vector operations to process all elements in the data set, such as by using a program loop.
  • The vectors in some further examples do not operate on elements that are sequential in memory, but instead operate on elements that are spaced some distance apart, such as on certain elements of a large array for scientific computing and modeling applications. This distance between elements in a vector is referred to as the stride, such that sequential words from memory have a stride of one, whereas a vector comprising every sixteenth element in memory has a stride of 16.
  • Vector processing provides other benefits to program efficiency, but at the cost of significant load or startup time relative to a scalar operation. Although the vectors must be completely loaded from memory before functions can be performed on the elements, other steps such as checking for variable independence need only be performed once for an entire vector operation. Instruction and coding efficiency are also improved with vector operations, as is memory access where the vector has a known or consistent memory access pattern. Vector processor design choices such as vector length consider these efficiencies and tradeoffs in an attempt to provide both good scalar operation performance and efficient vector operation.
  • SUMMARY
  • Some embodiments of the invention comprise a processor core that comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length. Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.
  • BRIEF DESCRIPTION OF THE FIGURES
  • FIG. 1 shows a reconfigurable vector space supporting four streams and a vector length of 16, consistent with an example embodiment of the invention.
  • FIG. 2 shows a reconfigurable vector space supporting 32 streams and a vector length of one, consistent with an example embodiment of the invention.
  • FIG. 3 shows a vector processor having configurable vector modes, consistent with an example embodiment of the invention.
  • DETAILED DESCRIPTION
  • In the following detailed description of example embodiments of the invention, reference is made to specific examples by way of drawings and illustrations. These examples are described in sufficient detail to enable those skilled in the art to practice the invention, and serve to illustrate how the invention may be applied to various purposes or applications. Other embodiments of the invention exist and are within the scope of the invention, and logical, mechanical, electrical, and other changes may be made without departing from the scope or subject of the present invention. Features or limitations of various embodiments of the invention described herein, however essential to the example embodiments in which they are incorporated, do not limit the invention as a whole, and any reference to the invention, its elements, operation, and application do not limit the invention as a whole but serve only to define these example embodiments. The following detailed description does not, therefore, limit the scope of the invention, which is defined only by the appended claims.
  • Vector processor architectures often include vector registers having a fixed number of entries, each vector register capable of holding a single vector. Vector functional units, such as an add/subtract unit, a multiply unit and a divide unit, and logic operation units are either dedicated to serving vector operations or are shared with scalar operations. Scalar registers are also used in some vector operations, such as where every element of a vector is multiplied by a scalar number. An example processor might have, for example, eight vector registers with 64 elements per register, where each element is a 64-bit word.
  • It is desirable in some applications to have vector lengths that are longer, while in other applications greater performance could be achieved if vector lengths were shorter or if the processor functioned more like a scalar processor. One embodiment of the invention seeks to address problems such as this by providing a reconfigurable processor core, such as where a more vectorized and a less vectorized configuration are available within the same processor core and can be selected to improve application execution efficiency.
  • In one such example, a processor chip contains 32 cores, where each core is capable of operating in either a vector threaded mode supporting four streams having a maximum vector length of 16, or a scalar threaded mode supporting 32 streams of a maximum vector length of one. Each mode has the same instruction set architecture, same instruction issue rate, and same instruction processing performance, but will provide different application performance based on the parallelization or vectorization that can be achieved for a given application.
  • In one such example illustrated in FIGS. 1 and 2, the vector registers and address registers allocated to different numbers of instruction streams are shown, demonstrating how an example register space is configured to facilitate changing vector modes. In this example, 3,072 registers are organized as 96 registers with 32 elements each. FIG. 1 shows the example register space configured to support four streams having a maximum vector length of 16, whereas FIG. 2 illustrates the same register space configured to support 32 streams with a maximum vector length of one.
  • Vector registers allocated to each of four different instruction streams of the four-stream 16-element vector configuration are shown at 101, each stream being allocated 32 registers having 16 elements each, such that there is a maximum vector length of 16. Address registers for each stream are allocated in register space 102, but only consume two elements of 32 registers per stream—the remaining register space that is crossed out is unused in this vector mode.
  • In FIG. 2, the same vector register space is configured such that each of 32 streams is allocated vector register space having 32 elements each, for a total of 1024 registers. The remaining 2048 registers are allocated as address registers as shown at 202, such that each of the 32 streams is allocated 64 address registers.
  • In this example embodiment, the address registers and vector registers are a part of a processor core, as shown in FIG. 3. Here, the XPipe element is the execution pipeline, as shown at 301, and includes the address register/vector register space shown in FIGS. 1 and 2 at 302 in FIG. 3. The MPipe, or memory pipeline that includes the load/store unit of the processor is shown at 303, and the IPipe or instruction pipeline is shown at 304. The instruction pipeline includes the instruction buffers and cache, and the instruction fetch and issue logic.
  • To change modes between fine-grained parallel applications that benefit from running in a 32-stream mode and coarse-grained parallel applications that benefit from the longer vector length of the 4-stream mode, the processor core quiets all executing threads in the core being reconfigured, and flushes the registers. The registers and instruction pipelines are reloaded under the new vector/stream mode, and execution is restarted.
  • Changing modes therefore involves repartitioning the register space and reassignment of registers to different streams, or between vector and address register allocation, depending on the embodiment being practiced. The actual register space remains the same, as is illustrated in the example of FIGS. 1 and 2, and the IPipe system remains the same but switches between four and 32 instruction streams based on the selected mode. A variety of other necessary or optional changes, such as changing a maxVL or maximum vector length register to reflect the new configured maximum vector length, are also employed in some embodiments, and are within the scope of the invention.
  • The processor of this example can therefore be configured for fine-grained or coarse-grained parallelism on the fly, even within an executing application. The ability to configure the processor core on the fly, even within a job or application, provides greater flexibility and efficiency in execution than prior systems could provide. Further, the ability to switch modes on a core-by-core basis rather than on a system-by-system basis or chip-by-chip basis enables configuration of individual cores to best suit the applications assigned to those specific cores. For example, a processor chip containing 32 cores can configure 28 cores to work on a coarse-grained parallel application using a vector length of 16, while the remaining four cores execute fine-grained threads that do not lend themselves to vector parallelization as well.
  • Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement which is calculated to achieve the same purpose may be substituted for the specific embodiments shown. This application is intended to cover any adaptations or variations of the example embodiments of the invention described herein. It is intended that this invention be limited only by the claims, and the full scope of equivalents thereof.

Claims (20)

1. A processor core, comprising:
one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length.
2. The processor core of claim 1, wherein the one or more vector units is further operable to change the number of instruction streams.
3. The processor core of claim 1, wherein the one or more vector units are operable to change the maximum vector length by changing vector register allocation in a register space.
4. The processor core of claim 1, wherein the one or more vector units are operable to change the maximum vector length by changing instruction issue mode.
5. The processor core of claim 1, wherein a processor comprises multiple processor cores each having one or more vector units, and the multiple processor cores are independently operable to change vector modes.
6. The processor core of claim 1, wherein changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.
7. The processor core of claim 1, wherein changing vector modes comprises using the same instruction set architecture in different vector modes.
8. A multiprocessor computer system, comprising:
a plurality of processing nodes, each node comprising one or more local processor cores, wherein the one or more local processor cores each comprise one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length.
9. The multiprocessor computer system of claim 8, wherein the one or more vector units is further operable to change the number of instruction streams.
10. The multiprocessor computer system of claim 8, wherein the one or more vector units are operable to change the maximum vector length by changing vector register allocation in a register space.
11. The multiprocessor computer system of claim 8, wherein the one or more vector units are operable to change the maximum vector length by changing instruction issue mode.
12. The multiprocessor computer system of claim 8, wherein changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.
13. The multiprocessor computer system of claim 8, wherein changing vector modes comprises using the same instruction set architecture in different vector modes.
14. The multiprocessor computer system of claim 8, wherein the one or more local processor cores are operable to independently change vector modes.
15. A method of operating a vector computer processor, comprising:
changing between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length.
16. The method of operating a vector computer processor of claim 15, wherein the change in vector mode is initiated by one or more of an application, an operating system, a batch system, or a processor core.
17. The method of operating a vector computer processor of claim 15, wherein changing vector modes comprises at least one of changing the number of instruction streams, changing vector register allocation in a register space, changing instruction issue mode.
18. The method of operating a vector computer processor of claim 15, wherein changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.
19. The method of operating a vector computer processor of claim 15, wherein the vector computer processor comprises multiple processor cores each having one or more vector units, and the multiple processor cores are independently operable to change vector modes.
20. The method of operating a vector computer processor of claim 15, wherein changing vector modes comprises using the same instruction set architecture in different vector modes.
US12/263,302 2008-10-31 2008-10-31 Configurable vector length computer processor Abandoned US20100115234A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US12/263,302 US20100115234A1 (en) 2008-10-31 2008-10-31 Configurable vector length computer processor
US13/409,033 US8601236B2 (en) 2008-10-31 2012-02-29 Configurable vector length computer processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/263,302 US20100115234A1 (en) 2008-10-31 2008-10-31 Configurable vector length computer processor

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/409,033 Continuation US8601236B2 (en) 2008-10-31 2012-02-29 Configurable vector length computer processor

Publications (1)

Publication Number Publication Date
US20100115234A1 true US20100115234A1 (en) 2010-05-06

Family

ID=42132906

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/263,302 Abandoned US20100115234A1 (en) 2008-10-31 2008-10-31 Configurable vector length computer processor
US13/409,033 Active US8601236B2 (en) 2008-10-31 2012-02-29 Configurable vector length computer processor

Family Applications After (1)

Application Number Title Priority Date Filing Date
US13/409,033 Active US8601236B2 (en) 2008-10-31 2012-02-29 Configurable vector length computer processor

Country Status (1)

Country Link
US (2) US20100115234A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688466A (en) * 2016-08-05 2018-02-13 北京中科寒武纪科技有限公司 A kind of arithmetic unit and its operating method

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11544214B2 (en) * 2015-02-02 2023-01-03 Optimum Semiconductor Technologies, Inc. Monolithic vector processor configured to operate on variable length vectors using a vector length register
US11416261B2 (en) * 2019-08-08 2022-08-16 Blaize, Inc. Group load register of a graph streaming processor
US11307860B1 (en) 2019-11-22 2022-04-19 Blaize, Inc. Iterating group sum of multiple accumulate operations

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809925B2 (en) * 2007-12-07 2010-10-05 International Business Machines Corporation Processing unit incorporating vectorizable execution unit

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6295597B1 (en) * 1998-08-11 2001-09-25 Cray, Inc. Apparatus and method for improved vector processing to support extended-length integer arithmetic
US7581037B2 (en) * 2005-03-15 2009-08-25 Intel Corporation Effecting a processor operating mode change to execute device code
US7492368B1 (en) * 2006-01-24 2009-02-17 Nvidia Corporation Apparatus, system, and method for coalescing parallel memory requests

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809925B2 (en) * 2007-12-07 2010-10-05 International Business Machines Corporation Processing unit incorporating vectorizable execution unit

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688466A (en) * 2016-08-05 2018-02-13 北京中科寒武纪科技有限公司 A kind of arithmetic unit and its operating method

Also Published As

Publication number Publication date
US20120221830A1 (en) 2012-08-30
US8601236B2 (en) 2013-12-03

Similar Documents

Publication Publication Date Title
Garland et al. Understanding throughput-oriented architectures
US9760373B2 (en) Functional unit having tree structure to support vector sorting algorithm and other algorithms
US7028170B2 (en) Processing architecture having a compare capability
KR101918464B1 (en) A processor and a swizzle pattern providing apparatus based on a swizzled virtual register
KR20120054027A (en) Mapping processing logic having data parallel threads across processors
JPH08249293A (en) System and method for parallel processing system using substitute instruction
KR20110112810A (en) Data processing method and device
KR20110044465A (en) Configuration processor, configuration control apparatus and method, and thread modeling method
WO2015114305A1 (en) A data processing apparatus and method for executing a vector scan instruction
US20140317626A1 (en) Processor for batch thread processing, batch thread processing method using the same, and code generation apparatus for batch thread processing
KR102279200B1 (en) Floating-point supportive pipeline for emulated shared memory architectures
KR100694212B1 (en) Distribution operating system functions for increased data processing performance in a multi-processor architecture
US8601236B2 (en) Configurable vector length computer processor
Wolf et al. AMIDAR project: lessons learned in 15 years of researching adaptive processors
Kim et al. Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programs
EP3746883A1 (en) Processor having multiple execution lanes and coupling of wide memory interface via writeback circuit
Buono et al. A lightweight run-time support for fast dense linear algebra on multi-core
US20160085719A1 (en) Presenting pipelines of multicore processors as separate processor cores to a programming framework
KR101420592B1 (en) Computer system
Ma et al. DO-GPU: Domain Optimizable Soft GPUs
CN106030517B (en) Architecture for emulating long latency operations in a shared memory architecture
JP2004503872A (en) Shared use computer system
Iwaya et al. The parallel processing feature of the NEC SX-3 supercomputer system
Duric et al. Imposing coarse-grained reconfiguration to general purpose processors
McMahon et al. Advanced Microprocessor Architectures

Legal Events

Date Code Title Description
AS Assignment

Owner name: CRAY INC.,WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FAANES, GREGORY J.;LUNDBERG, ERIC P.;BATAINEH, ABDULLA;AND OTHERS;REEL/FRAME:022486/0972

Effective date: 20090324

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION