WO2016114771A1 - Reduced core count system configuration - Google Patents

Reduced core count system configuration Download PDF

Info

Publication number
WO2016114771A1
WO2016114771A1 PCT/US2015/011358 US2015011358W WO2016114771A1 WO 2016114771 A1 WO2016114771 A1 WO 2016114771A1 US 2015011358 W US2015011358 W US 2015011358W WO 2016114771 A1 WO2016114771 A1 WO 2016114771A1
Authority
WO
WIPO (PCT)
Prior art keywords
reduced core
core count
count configuration
configuration
reduced
Prior art date
Application number
PCT/US2015/011358
Other languages
French (fr)
Inventor
Michael B. Calhoun
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2015/011358 priority Critical patent/WO2016114771A1/en
Publication of WO2016114771A1 publication Critical patent/WO2016114771A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3296Power saving characterised by the action undertaken by lowering the supply or operating voltage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/324Power saving characterised by the action undertaken by lowering clock frequency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3243Power saving in microcontroller unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/329Power saving characterised by the action undertaken by task scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5094Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Figure 1 illustrates one example of a multiple core processor.
  • Figure 2 illustrates one example of a system for selecting and configuring a reduced core count configuration for the system.
  • Figure 3 is a flow diagram illustrating one example of a method for selecting a reduced core count configuration for a system.
  • Figure 4 is a flow diagram illustrating another example of a method for selecting a reduced core count configuration for a system.
  • Figures 5A-5B are flow diagrams illustrating one example of a method for configuring and reconfiguring a system based on an application workload.
  • Figure 6 illustrates one example of a computer system that may be used to implement the example methods.
  • Symmetric multiprocessing (SMP) systems include a multiprocessor system hardware and software architecture where two or more processors connect to a single, shared main memory and are controlled by a single operating system instance that treats the processors equally, typically reserving none of the processors for special purposes.
  • Many multiprocessor systems today use an SMP architecture.
  • the SMP architecture applies to the cores, treating them as separate processors.
  • SMP systems are tightly coupled multiprocessor systems with a pool of homogeneous processors running independently, each processor executing different programs and working on different data and with the capability of sharing common resources such as memory, input/output devices, interrupt system and so on, connected together using a system bus or a crossbar.
  • An example of a computer application configured to execute on the SMP environment includes at least one process, which is an executing program. Each process provides the resources to execute the program.
  • One or more threads run in the context of the process.
  • a thread is the basic unit to which an operating system allocates time in a processor.
  • the thread is the entity within a process that can be scheduled for execution. Threads of a process can share virtual address space and system resources.
  • Each thread can include exception handlers, a scheduling priority, thread local storage, a thread identifier, and a thread context (or thread state) until the thread is scheduled.
  • threads can be concurrently executed on multiple processors.
  • multi-core processors are efficient at executing parallel threads, which is a form of instruction level parallelism.
  • Figure 1 illustrates an example multiple core processor 100 that can be implemented in an SMP system to concurrently execute threads or processes.
  • Multiple core processor 100 may be a central processing unit (CPU). This example includes multiple cores implemented on a single die 102.
  • the multiple core processor 100 is configured as an integrated circuit chip to fit into a socket that provides mechanical and electrical connections with a printed circuit board such as a motherboard on a computing device such as a server.
  • multiple core processor 100 includes four physical processor cores 104, 106, 108, 1 10, or simply four physical cores, where each of the physical cores is available to process at least one application thread concurrently with at least one other physical core processing another thread.
  • the physical cores 104, 106, 108, 1 10, are adjacent to a memory controller 1 12 and a cache 1 14 on the die 102.
  • Each of the cores is associated with a cache hierarchy.
  • the architecture of processor 100 includes cache in the physical core 104, 106, 108, 1 10 (such as L1 and L2 cache), an L3 cache in cache 1 14, memory served by memory controller 1 12, and so on.
  • Caches L1 , L2, and L3 in this example can represent on-die memory because they are located on the die 102, whereas the memory hierarchy can further extend to off- die memory.
  • Each location in memory includes a cache line, which can vary in length depending on the processor and memory used.
  • a queue 1 16 is disposed on the die between memory controller 1 12 and cache 1 14.
  • the die 102 can include other features 1 18 or combination of features such as memory interfaces, miscellaneous input/output blocks, proprietary interconnects, expansion card interfaces, and the like.
  • each physical core may be capable of efficiently and concurrently executing multiple threads of a concurrent process. Such physical cores are often referred to as “Simultaneous M u I ti -Threading," or simply "SMT,” cores, and the concurrently executed threads on each physical core share hardware resources included within the single physical core.
  • each physical core is capable of multithreading. Each physical core capable of multithreading can present the operating system with as many logical cores as concurrently executing threads it supports.
  • each physical core 104, 106, 108, 1 10 is capable of concurrently executing two threads, and thus provides the operating system with eight concurrent logical cores.
  • the multiple core processor 100 can be included as part of multiple processor architectures depending upon performance
  • NUMA Non-uniform memory access
  • SMP symmetric multiprocessing
  • the memory access time depends on the memory location relative to the processor, or memory distance, so a processor core can access its own local memory faster than non-local memory such as memory local to another processor or memory shared between processors.
  • NUMA can be used as a tightly coupled form of cluster computing and virtual memory paging added to a cluster architecture can allow the implementation of NUMA entirely in software.
  • multiple core processors in multi-socket servers are homogenous, but the use of
  • heterogeneous multiple core processors is contemplated.
  • processor architectures can further be combined with other multiple processor architectures in distributed systems.
  • the myriad of available or later developed combinations of logical cores, physical cores, processors, and multiple processor systems that can be used to implement the mechanism is not limited to any particular processing system or architecture.
  • each unit capable of concurrently executing a thread or component is generally described here as a "core.”
  • graphics processing units GPU
  • GPU graphics processing units
  • a GPU can also be an example of a multiple core processor.
  • Multiple core processors can include more than two or four cores, and processor 100 is presented as an example. Accordingly, systems or processors including more cores are contemplated.
  • a central processing unit available from Intel Corporation and sold under the trade designation Xeon Processor E7-2890 v2 having a 37.5 megabyte cache and a clock speed of 2.8 gigahertz includes fifteen physical cores and uses approximately 155 Watts of power. Each physical core is capable of simultaneously multithreading two threads and thus provides the operating system with thirty logical cores.
  • processors include a mode for running at a higher clock speed, such as a boost mode that may increase clock speed by 33% or more, but boost mode can be run for relatively short periods of time due to concerns of overheating and the lack of efficient cooling of the die.
  • Memory-level parallelism is a term in computer architecture that refers to the ability to have pending multiple memory
  • memory-level parallelism is aggressive out-of-order execution, which uses large instruction window resources, i.e., the reorder buffer, the issue queue, and the load/store queue. Again, simply enlarging these resources degrades the clock cycle time in multi-processing systems, although other memory-level parallelism schemes are directed to addressing these issues. Examples of types of computer applications that benefit from efficient memory-level parallelism include large databases, database management services, online transaction processing, online analytical processing, and others.
  • the total solution cost for a particular computing application is a combination of hardware costs, software costs, power costs, and cooling costs.
  • the primary driver of the total solution cost is typically the software costs (i.e., operating system (OS), database, application, middleware, etc.).
  • the software components are typically priced based primarily on CPU system scale (i.e., physical core count), but their performance is more tightly linked to available physical memory. Adding a marginal core slightly improves the performance, but greatly increases the total solution cost due to software licensing.
  • the total solution cost for an application can be reduced by configuring a system using firmware to maximize the per-core performance. By reducing the number of cores used to implement an application, software licensing costs and therefore total solution cost can be substantially reduced while maintaining sufficient performance.
  • FIG. 2 illustrates one example of a system 200 for selecting and configuring a reduced core count configuration for the system.
  • System 200 includes an inter-domain fabric 202, a plurality of domains 204a-204c, system resources 214, and an analysis engine 224.
  • Each domain 204a-204c includes a CPU 206a-206c, a memory 208a-208c, voltage and frequency (V/F) control 210a-210c, and system firmware 212a-212c, respectively.
  • V/F voltage and frequency
  • System firmware 212a-212c is communicatively coupled to CPU 206a-206c and memory 208a- 208c through a communication path 209a-209c and to V/F control 210a-210c through a communication path 21 1 a-21 1 c, respectively.
  • Each CPU 206a-206c is communicatively coupled to inter-domain fabric 202 through a communication path 203a-203c, respectively.
  • System firmware 212a-212c is communicatively coupled to analysis engine 224 through communication path 213.
  • Analysis engine 224 includes system configuration and status 226, a system efficiency table 228, and a system resources list 230.
  • System resources 214 include application performance data 216, power data 218, operating system data 220, and licensing data 222.
  • Application performance data 216 is exchanged with analysis engine 224 through a communication path 217.
  • Power data 218 is exchanged with analysis engine 224 through a communication path 219.
  • Operating system data 220 is exchanged with analysis engine 224 through a communication path 221 .
  • Licensing data 222 is exchanged with analysis engine 224 through a communication path 223.
  • inter-domain fabric 202 and domains 204a-204c provide a server. While three domains are illustrated in Figure 2, the disclosure is applicable to any suitable number of domains such as a single domain system or a system having more than three domains.
  • Each CPU 206a-206c may include one or more processor cores, such as four cores as in multiple core processor 100 previously described and illustrated with reference to Figure 1 .
  • Each memory 208a-208c includes any suitable combination of volatile and/or non-volatile memory, such as combinations of Random Access Memory (RAM), Read-Only Memory (ROM), flash memory, and/or other suitable memory.
  • Each V/F control 210a-210c controls the operating voltages (e.g., core voltage, memory voltage) and clock frequencies (e.g., core clock frequency, memory clock frequency) for each domain 204a-204c, respectively.
  • System firmware 212a-212c includes commands for configuring each domain 204a-204c including setting the operating voltages and clock frequencies and the number of active cores for each CPU 206a-206c, respectively.
  • System resources 214 contains the data used for configuring system 200.
  • Application performance data 216, power data 218, operating system data 220, and licensing data 222 varies depending on the number of active cores.
  • Analysis engine 224 determines the configuration for system 200 based on the system resources data for each reduced core count configuration.
  • System configuration and status 226 indicates the current system 200 configuration and status (i.e., active reduced core count configuration).
  • System efficiency table 228 maintains data collected on the system performance as it relates to core count.
  • System resources list 230 maintains a list of viable reduced core count configurations and their corresponding components available for system 200.
  • Analysis engine 224 can modify system firmware 212a-212c to configure each domain 204a-204c, respectively, for testing and deployment. For each reduced core count configuration, analysis engine 224 determines the optimum processor clock frequency and the optimum core voltage for each CPU 206a- 206c. For example, for four core CPUs 206a-206c, one reduced core count configuration may include one active core of each CPU 206a-206c such that the reduced core count configuration includes three active cores. In another example, another reduced core count configuration may include two active cores of CPU 206a and one active core of each of CPUs 206b and 206c such that the reduced core count configuration includes four active cores. In this example where each CPU 206a-206c includes four cores, a reduced core count configuration is any configuration less than 12 cores. The optimum processor clock frequencies and optimum core voltages for each reduced core count configuration may be maintained by system efficiency table 228.
  • Analysis engine 224 measures the performance of system 200 as a function of active processor cores while executing a selected application to be optimized.
  • the measured performance data, power data, operating system data, and licensing data for each reduced core count configuration may be maintained by system resources 214.
  • viable reduced core count configurations are determined and may be maintained by system resources list 230.
  • a viable reduced core count configuration may be selected and domains 204a-204c may be configured for deployment to execute the selected
  • Domains 204a-204c are configured by updating system firmware 212a-212c, respectively.
  • Figure 3 is a flow diagram illustrating one example of a method 300 for selecting a reduced core count configuration for a system, such as system 200 previously described and illustrated with reference to Figure 2.
  • a system such as system 200 previously described and illustrated with reference to Figure 2.
  • an optimum processor clock frequency and an optimum core voltage are
  • processor clock frequency and the optimum core voltage for each reduced core count configuration may be determined by lab testing. As the number of active cores decreases, the optimum core voltage and the optimum clock frequency increases. Accordingly, as the number of active cores decreases, the
  • the optimum memory clock frequency and the optimum memory voltage for each reduced core count configuration is determined. As the number of active cores decreases, the demand on memory decreases, particularly with increased per-core cache available. Accordingly, as the number of active cores decreases, the memory clock frequency and the memory voltage may be decreased to reduce power costs.
  • the cooling cost for each reduced core count configuration may increase as the number of active cores decreases due to increased core clock
  • a reduced core count configuration is selected from the plurality of reduced core count configurations for an application based on a performance and a cost of each reduced core count configuration while executing the application.
  • the performance of the system is measured while running the application to be optimized using each reduced core count configuration.
  • the cost of input power to the system and cooling for the system is tracked for each reduced core count configuration.
  • the performance of the system as a function of active cores (k) can be modeled by:
  • Perf(k) is monotonic in the useable range with zero intercept and a slope strictly less than one.
  • the performance of the system for each reduced core count configuration is determined by measuring transactions per minute executed by the system for each reduced core count configuration.
  • the cost of the system as a function of active cores is the sum of the hardware (HW) costs, software (SW) costs, power costs, and cooling costs as given by:
  • the system may then be configured to execute the application using the selected reduced core count configuration.
  • Figure 4 is a flow diagram illustrating another example of a method 310 for selecting a reduced core count configuration for a system, such as system 200 previously described and illustrated with reference to Figure 2.
  • an optimum processor clock frequency and an optimum core voltage are determined for each of a plurality of reduced core count configurations for a system configurable to use up to a set number of cores.
  • a performance of the system for each reduced core count configuration is measured while executing a selected application on the system. In one example, the performance of the system for each reduced core count configuration is measured by measuring the transactions per minute executed by the system for each reduced core count configuration.
  • a cost for each reduced core count configuration is determined. In one example, the cost of the system for each reduced core count
  • each reduced core count configuration is determined based on hardware costs, software costs, power costs, and cooling costs for each reduced core count configuration.
  • an efficiency of each reduced core count configuration is calculated by dividing the measured performance by the determined cost for each reduced core count configuration.
  • a reduced core count configuration based on the calculated efficiency of each reduced core count configuration is selected. In one example, the reduced core count configuration providing the greatest performance for the lowest cost is selected.
  • the system is configured to execute the selected application using the selected reduced core count configuration.
  • the system is configured by configuring voltage regulator modules of the system to provide the optimum processor clock frequency and the optimum core voltage for the selected reduced core count configuration.
  • Configuring the system may also include configuring an operating system of the system to use the cores of the selected reduced core count configuration.
  • the system may be configured by loading firmware on the system including commands to configure the system to use the selected reduced core count configuration.
  • Figure 5A is a flow diagram illustrating one example of a method 400 for configuring a system based on an application workload.
  • an optimum processor clock frequency and an optimum core voltage are determined for each of a plurality of reduced core count configurations for a system
  • a performance of the system is measured for each reduced core count configuration for each of a plurality of different workloads of a selected application executing on the system.
  • a cost for each reduced core count configuration for each of the different workloads is determined.
  • an efficiency of each reduced core count configuration for each of the different workloads is calculated by dividing the measured performance by the determined cost for each reduced core count configuration for each of the different workloads.
  • a first workload from the plurality of different workloads is selected.
  • a first reduced core count configuration based on the calculated efficiency of each reduced core count configuration for the selected first workload is selected.
  • the system is configured to execute the selected application up to the selected first workload using the selected first reduced core count configuration. In one example, the system is configured to execute a database application.
  • Figure 5B is a flow diagram illustrating one example of a method 420 for reconfiguring a system based on an application workload.
  • a request to modify the system for a selected second workload from the plurality of workloads is received.
  • a second reduced core count configuration based on the calculated efficiency of each reduced core count configuration for the selected second workload is selected.
  • the system is reconfigured to execute the selected application up to the selected second workload using the selected second reduced core count configuration.
  • Methods 400 and 420 enable a deployed system to be reconfigured in place to accommodate changes to an application workload.
  • the hardware needed to accommodate a greater workload is already in place such that by selecting a different reduced core count configuration, more cores can be activated to handle a greater workload.
  • the hardware vendor and the software vender may work together on security measures to ensure the customer cannot activate more cores than the number of cores for which the software has been licensed.
  • Figure 6 illustrates an example computer system that can be employed in an operating environment and used to host or run a computer application implementing the example methods 300, 310, 400, and 420 as included on one or more computer readable storage mediums storing computer executable instructions for controlling the computer system, such as a computing device, to perform a process.
  • the computer system of Figure 6 can be used to implement the analysis engine 224 set forth in system 200.
  • the exemplary computer system of Figure 6 includes a computing device, such as computing device 500.
  • Computing device 500 typically includes one or more processors 502 and memory 504.
  • the processors 502 may include two or more processing cores on a chip or two or more processor chips.
  • the computing device 500 can also have one or more additional processing or specialized processors (not shown), such as a graphics processor for general-purpose computing on graphics processor units, to perform processing functions offloaded from the processors 502.
  • Memory 504 may be arranged in a hierarchy and may include one or more levels of cache. Memory 504 may be volatile (such as random access memory (RAM)), nonvolatile (such as read only memory (ROM), flash memory, etc.), or some combination of the two.
  • RAM random access memory
  • ROM read only memory
  • flash memory etc.
  • the computing device 500 can take one or more of several forms.
  • Such forms include a tablet, a personal computer, a workstation, a server, a handheld device, a consumer electronic device, or other, and can be a stand-alone device or configured as part of a computer network, computer cluster, cloud services infrastructure, or other.
  • Computing device 500 may also include additional storage 508.
  • Storage 508 may be removable and/or non-removable and can include magnetic or optical disks or solid-state memory, or flash storage devices.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any suitable method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. A propagating signal by itself does not qualify as storage media.
  • Computing device 500 often includes one or more input and/or output connections, such as USB connections, display ports, proprietary connections, and others to connect to various devices to receive and/or provide inputs and outputs.
  • Input devices 510 may include devices such as a keyboard, a pointing device (e.g., mouse), a pen, a voice input device, a touch input device, or other.
  • Output devices 512 may include devices such as a display, speakers, a printer, or the like.
  • Computing device 500 often includes one or more communication connections 514 that allow computing device 500 to communicate with other computers/applications 516.
  • Example communication connections can include, but are not limited to, an Ethernet interface, a wireless interface, a bus interface, a storage area network interface, and a proprietary interface.
  • communication connections can be used to couple the computing device 500 to a computer network 518, which is a collection of computing devices and possibly other devices interconnected by communications channels that facilitate communications and allows sharing of resources and information among interconnected devices.
  • a computer network 518 is a collection of computing devices and possibly other devices interconnected by communications channels that facilitate communications and allows sharing of resources and information among interconnected devices.
  • Examples of computer networks include a local area network, a wide area network, the Internet, or other network.
  • Computing device 500 can be configured to run an operating system software program and one or more computer applications, which make up a system platform.
  • a computer application configured to execute on the computing device 500 is typically provided as a set of instructions written in a programming language.
  • a computer application configured to execute on the computing device 500 includes at least one computing process (or computing task), which is an executing program. Each computing process provides the computing resources to execute the program.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Power Sources (AREA)

Abstract

One example provides a method for configuring a system. The method includes determining an optimum processor clock frequency and an optimum core voltage for each of a plurality of reduced core count configurations for a system configurable to use up to a set number of cores. The method includes selecting a reduced core count configuration from the plurality of reduced core count configurations for an application based on a performance and a cost of each reduced core count configuration while executing the application.

Description

REDUCED CORE COUNT SYSTEM CONFIGURATION
Background
[0001] Over the previous decade, the development of multiple processors and concurrent programming has become an emphasis for scaling the performance of computing systems. The growth of raw sequential processing power has flattened as processor manufacturers have reached roadblocks in providing significant increases to processor clock frequency. Processors continue to evolve, but the current focus for improving processor power is to provide multiple processor cores on a single die to increase processor throughput. Sequential applications, which have previously benefited from increased clock speed, obtain significantly less scaling as the number of processor cores increase. To take advantage of multiple core systems, concurrent (or parallel) applications are written to include concurrent threads distributed over the cores and operating systems have been designed to concurrently operate multiple applications.
Brief Description of the Drawings
[0002] Figure 1 illustrates one example of a multiple core processor.
[0003] Figure 2 illustrates one example of a system for selecting and configuring a reduced core count configuration for the system.
[0004] Figure 3 is a flow diagram illustrating one example of a method for selecting a reduced core count configuration for a system. [0005] Figure 4 is a flow diagram illustrating another example of a method for selecting a reduced core count configuration for a system.
[0006] Figures 5A-5B are flow diagrams illustrating one example of a method for configuring and reconfiguring a system based on an application workload.
[0007] Figure 6 illustrates one example of a computer system that may be used to implement the example methods.
Detailed Description
[0008] In the following detailed description, reference is made to the
accompanying drawings which form a part hereof, and in which is shown by way of illustration specific examples in which the disclosure may be practiced. It is to be understood that other examples may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. It is to be understood that features of the various examples described herein may be combined, in part or whole, with each other, unless specifically noted otherwise.
[0009] Symmetric multiprocessing (SMP) systems include a multiprocessor system hardware and software architecture where two or more processors connect to a single, shared main memory and are controlled by a single operating system instance that treats the processors equally, typically reserving none of the processors for special purposes. Many multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. In one common example, SMP systems are tightly coupled multiprocessor systems with a pool of homogeneous processors running independently, each processor executing different programs and working on different data and with the capability of sharing common resources such as memory, input/output devices, interrupt system and so on, connected together using a system bus or a crossbar. [0010] An example of a computer application configured to execute on the SMP environment includes at least one process, which is an executing program. Each process provides the resources to execute the program. One or more threads run in the context of the process. A thread is the basic unit to which an operating system allocates time in a processor. The thread is the entity within a process that can be scheduled for execution. Threads of a process can share virtual address space and system resources. Each thread can include exception handlers, a scheduling priority, thread local storage, a thread identifier, and a thread context (or thread state) until the thread is scheduled. In parallel applications, threads can be concurrently executed on multiple processors. Generally, multi-core processors are efficient at executing parallel threads, which is a form of instruction level parallelism.
[0011] Figure 1 illustrates an example multiple core processor 100 that can be implemented in an SMP system to concurrently execute threads or processes. Multiple core processor 100 may be a central processing unit (CPU). This example includes multiple cores implemented on a single die 102. In one example, the multiple core processor 100 is configured as an integrated circuit chip to fit into a socket that provides mechanical and electrical connections with a printed circuit board such as a motherboard on a computing device such as a server.
[0012] In this example, multiple core processor 100 includes four physical processor cores 104, 106, 108, 1 10, or simply four physical cores, where each of the physical cores is available to process at least one application thread concurrently with at least one other physical core processing another thread. The physical cores 104, 106, 108, 1 10, are adjacent to a memory controller 1 12 and a cache 1 14 on the die 102. Each of the cores is associated with a cache hierarchy. In one example, the architecture of processor 100 includes cache in the physical core 104, 106, 108, 1 10 (such as L1 and L2 cache), an L3 cache in cache 1 14, memory served by memory controller 1 12, and so on. Caches L1 , L2, and L3 in this example can represent on-die memory because they are located on the die 102, whereas the memory hierarchy can further extend to off- die memory. Each location in memory includes a cache line, which can vary in length depending on the processor and memory used. In the example die 102, a queue 1 16 is disposed on the die between memory controller 1 12 and cache 1 14. The die 102 can include other features 1 18 or combination of features such as memory interfaces, miscellaneous input/output blocks, proprietary interconnects, expansion card interfaces, and the like.
[0013] In some examples, each physical core may be capable of efficiently and concurrently executing multiple threads of a concurrent process. Such physical cores are often referred to as "Simultaneous M u I ti -Threading," or simply "SMT," cores, and the concurrently executed threads on each physical core share hardware resources included within the single physical core. In one example of the multiple core processor 100, each physical core is capable of multithreading. Each physical core capable of multithreading can present the operating system with as many logical cores as concurrently executing threads it supports. In one example of multiple core processor 100, each physical core 104, 106, 108, 1 10 is capable of concurrently executing two threads, and thus provides the operating system with eight concurrent logical cores.
[0014] In some examples, the multiple core processor 100 can be included as part of multiple processor architectures depending upon performance
considerations. An example of such architectures includes Non-uniform memory access (NUMA) that can be used to scale symmetric multiprocessing (SMP) systems. In a NUMA architecture, the memory access time depends on the memory location relative to the processor, or memory distance, so a processor core can access its own local memory faster than non-local memory such as memory local to another processor or memory shared between processors. NUMA can be used as a tightly coupled form of cluster computing and virtual memory paging added to a cluster architecture can allow the implementation of NUMA entirely in software. In many instances, multiple core processors in multi-socket servers are homogenous, but the use of
heterogeneous multiple core processors is contemplated.
[0015] Multiple processor architectures can further be combined with other multiple processor architectures in distributed systems. The myriad of available or later developed combinations of logical cores, physical cores, processors, and multiple processor systems that can be used to implement the mechanism is not limited to any particular processing system or architecture. To account for the multiple architectures available for use with this disclosure, each unit capable of concurrently executing a thread or component is generally described here as a "core." In some examples, graphics processing units (GPU) are employed in making intensive calculations as general-purpose computation on graphics processing units (GPGPU). A GPU can also be an example of a multiple core processor.
[0016] Multiple core processors can include more than two or four cores, and processor 100 is presented as an example. Accordingly, systems or processors including more cores are contemplated. In one example, a central processing unit available from Intel Corporation and sold under the trade designation Xeon Processor E7-2890 v2 having a 37.5 megabyte cache and a clock speed of 2.8 gigahertz includes fifteen physical cores and uses approximately 155 Watts of power. Each physical core is capable of simultaneously multithreading two threads and thus provides the operating system with thirty logical cores.
[0017] The added cost of additional processors per server to improve
performance is a relatively low percentage of the system cost that many consumers find attractive, but the trend toward raising the number of core counts per die or the number of multi-core processors per system has run into some roadblocks. For example, the relatively lower speed of memory, included caches, as compared to processor speed limits instructions per cycle. Also, power consumption in a multi-core processor having a relatively large number of core counts tends to limit clock speed, and generally adding cores tends to diminish clock speed altogether. Many processors include a mode for running at a higher clock speed, such as a boost mode that may increase clock speed by 33% or more, but boost mode can be run for relatively short periods of time due to concerns of overheating and the lack of efficient cooling of the die.
Hardware concerns regarding clock speed causes the multi-core processor to rely more on software parallelism for increased performance. Among software concerns include Amdahl's Law, which generally states the time for the sequential fraction of a computer application limits the pace of a program using multiple processors in parallel computing. Also, software engineers have found difficulty in maintaining concurrency in multi-core processors.
[0018] The difficulties of using multi-core processors, or simultaneous
processing systems in general, have led to improvements in memory-level parallelism, rather than instruction level parallelism, as an effective way to overcome these problems. Memory-level parallelism is a term in computer architecture that refers to the ability to have pending multiple memory
operations, such as cache misses or translation lookaside buffer misses, at the same time. One example for exploiting memory-level parallelism is aggressive out-of-order execution, which uses large instruction window resources, i.e., the reorder buffer, the issue queue, and the load/store queue. Again, simply enlarging these resources degrades the clock cycle time in multi-processing systems, although other memory-level parallelism schemes are directed to addressing these issues. Examples of types of computer applications that benefit from efficient memory-level parallelism include large databases, database management services, online transaction processing, online analytical processing, and others.
[0019] The total solution cost for a particular computing application is a combination of hardware costs, software costs, power costs, and cooling costs. The primary driver of the total solution cost is typically the software costs (i.e., operating system (OS), database, application, middleware, etc.). The software components are typically priced based primarily on CPU system scale (i.e., physical core count), but their performance is more tightly linked to available physical memory. Adding a marginal core slightly improves the performance, but greatly increases the total solution cost due to software licensing. As will be disclosed herein, the total solution cost for an application can be reduced by configuring a system using firmware to maximize the per-core performance. By reducing the number of cores used to implement an application, software licensing costs and therefore total solution cost can be substantially reduced while maintaining sufficient performance.
[0020] Figure 2 illustrates one example of a system 200 for selecting and configuring a reduced core count configuration for the system. System 200 includes an inter-domain fabric 202, a plurality of domains 204a-204c, system resources 214, and an analysis engine 224. Each domain 204a-204c includes a CPU 206a-206c, a memory 208a-208c, voltage and frequency (V/F) control 210a-210c, and system firmware 212a-212c, respectively. System firmware 212a-212c is communicatively coupled to CPU 206a-206c and memory 208a- 208c through a communication path 209a-209c and to V/F control 210a-210c through a communication path 21 1 a-21 1 c, respectively. Each CPU 206a-206c is communicatively coupled to inter-domain fabric 202 through a communication path 203a-203c, respectively. System firmware 212a-212c is communicatively coupled to analysis engine 224 through communication path 213.
[0021] Analysis engine 224 includes system configuration and status 226, a system efficiency table 228, and a system resources list 230. System resources 214 include application performance data 216, power data 218, operating system data 220, and licensing data 222. Application performance data 216 is exchanged with analysis engine 224 through a communication path 217. Power data 218 is exchanged with analysis engine 224 through a communication path 219. Operating system data 220 is exchanged with analysis engine 224 through a communication path 221 . Licensing data 222 is exchanged with analysis engine 224 through a communication path 223.
[0022] In one example, inter-domain fabric 202 and domains 204a-204c provide a server. While three domains are illustrated in Figure 2, the disclosure is applicable to any suitable number of domains such as a single domain system or a system having more than three domains. Each CPU 206a-206c may include one or more processor cores, such as four cores as in multiple core processor 100 previously described and illustrated with reference to Figure 1 . Each memory 208a-208c includes any suitable combination of volatile and/or non-volatile memory, such as combinations of Random Access Memory (RAM), Read-Only Memory (ROM), flash memory, and/or other suitable memory. Each V/F control 210a-210c controls the operating voltages (e.g., core voltage, memory voltage) and clock frequencies (e.g., core clock frequency, memory clock frequency) for each domain 204a-204c, respectively. System firmware 212a-212c includes commands for configuring each domain 204a-204c including setting the operating voltages and clock frequencies and the number of active cores for each CPU 206a-206c, respectively.
[0023] System resources 214 contains the data used for configuring system 200. Application performance data 216, power data 218, operating system data 220, and licensing data 222 varies depending on the number of active cores. Analysis engine 224 determines the configuration for system 200 based on the system resources data for each reduced core count configuration. System configuration and status 226 indicates the current system 200 configuration and status (i.e., active reduced core count configuration). System efficiency table 228 maintains data collected on the system performance as it relates to core count. System resources list 230 maintains a list of viable reduced core count configurations and their corresponding components available for system 200.
[0024] Analysis engine 224 can modify system firmware 212a-212c to configure each domain 204a-204c, respectively, for testing and deployment. For each reduced core count configuration, analysis engine 224 determines the optimum processor clock frequency and the optimum core voltage for each CPU 206a- 206c. For example, for four core CPUs 206a-206c, one reduced core count configuration may include one active core of each CPU 206a-206c such that the reduced core count configuration includes three active cores. In another example, another reduced core count configuration may include two active cores of CPU 206a and one active core of each of CPUs 206b and 206c such that the reduced core count configuration includes four active cores. In this example where each CPU 206a-206c includes four cores, a reduced core count configuration is any configuration less than 12 cores. The optimum processor clock frequencies and optimum core voltages for each reduced core count configuration may be maintained by system efficiency table 228.
[0025] Analysis engine 224 then measures the performance of system 200 as a function of active processor cores while executing a selected application to be optimized. The measured performance data, power data, operating system data, and licensing data for each reduced core count configuration may be maintained by system resources 214. Based on the measured performance data of each reduced core count configuration, viable reduced core count configurations are determined and may be maintained by system resources list 230. A viable reduced core count configuration may be selected and domains 204a-204c may be configured for deployment to execute the selected
application using the selected reduced core count configuration. Domains 204a-204c are configured by updating system firmware 212a-212c, respectively.
[0026] Figure 3 is a flow diagram illustrating one example of a method 300 for selecting a reduced core count configuration for a system, such as system 200 previously described and illustrated with reference to Figure 2. At 302, an optimum processor clock frequency and an optimum core voltage are
determined for each of a plurality of reduced core count configurations for a system configurable to use up to a set number of cores. The optimum
processor clock frequency and the optimum core voltage for each reduced core count configuration may be determined by lab testing. As the number of active cores decreases, the optimum core voltage and the optimum clock frequency increases. Accordingly, as the number of active cores decreases, the
performance per-core increases.
[0027] In one example, the optimum memory clock frequency and the optimum memory voltage for each reduced core count configuration is determined. As the number of active cores decreases, the demand on memory decreases, particularly with increased per-core cache available. Accordingly, as the number of active cores decreases, the memory clock frequency and the memory voltage may be decreased to reduce power costs.
[0028] The cooling cost for each reduced core count configuration may increase as the number of active cores decreases due to increased core clock
frequencies and core voltages. Larger heat sinks, active cooling, water cooling, and the like may be used for reduced core count configurations since the added cost for increased cooling is substantially less than the increased software costs when using a greater number of active cores.
[0029] At 304, a reduced core count configuration is selected from the plurality of reduced core count configurations for an application based on a performance and a cost of each reduced core count configuration while executing the application. The performance of the system is measured while running the application to be optimized using each reduced core count configuration. At the same time, the cost of input power to the system and cooling for the system is tracked for each reduced core count configuration. The performance of the system as a function of active cores (k) can be modeled by:
Perf(k) = d *k - c2*k2
where c1 and c2 are constants solved from measurements on the system. Perf(k) is monotonic in the useable range with zero intercept and a slope strictly less than one. In one example, the performance of the system for each reduced core count configuration is determined by measuring transactions per minute executed by the system for each reduced core count configuration.
[0030] The cost of the system as a function of active cores (k) is the sum of the hardware (HW) costs, software (SW) costs, power costs, and cooling costs as given by:
Cost(k) = HW + SW(k) + Power(k) + Cooling(k)
[0031] An efficiency (E) of each reduced core count configuration is calculated by dividing the performance by the cost of each reduced core count
configuration as given by:
E(k) = Perf(k)/Cost(k)
[0032] The optimal point(s) of operation for the system are determined by solving for k:
dE/dk = 0
(d/dk)(Perf(k)/Cost(k)) = 0
[0033] The system may then be configured to execute the application using the selected reduced core count configuration.
[0034] Figure 4 is a flow diagram illustrating another example of a method 310 for selecting a reduced core count configuration for a system, such as system 200 previously described and illustrated with reference to Figure 2. At 312, an optimum processor clock frequency and an optimum core voltage are determined for each of a plurality of reduced core count configurations for a system configurable to use up to a set number of cores. At 314, a performance of the system for each reduced core count configuration is measured while executing a selected application on the system. In one example, the performance of the system for each reduced core count configuration is measured by measuring the transactions per minute executed by the system for each reduced core count configuration.
[0035] At 316, a cost for each reduced core count configuration is determined. In one example, the cost of the system for each reduced core count
configuration is determined based on hardware costs, software costs, power costs, and cooling costs for each reduced core count configuration. At 318, an efficiency of each reduced core count configuration is calculated by dividing the measured performance by the determined cost for each reduced core count configuration. At 318, a reduced core count configuration based on the calculated efficiency of each reduced core count configuration is selected. In one example, the reduced core count configuration providing the greatest performance for the lowest cost is selected.
[0036] At 322, the system is configured to execute the selected application using the selected reduced core count configuration. In one example, the system is configured by configuring voltage regulator modules of the system to provide the optimum processor clock frequency and the optimum core voltage for the selected reduced core count configuration. Configuring the system may also include configuring an operating system of the system to use the cores of the selected reduced core count configuration. The system may be configured by loading firmware on the system including commands to configure the system to use the selected reduced core count configuration.
[0037] Figure 5A is a flow diagram illustrating one example of a method 400 for configuring a system based on an application workload. At 402, an optimum processor clock frequency and an optimum core voltage are determined for each of a plurality of reduced core count configurations for a system
configurable to use up to a set number of cores. At 404, a performance of the system is measured for each reduced core count configuration for each of a plurality of different workloads of a selected application executing on the system. At 406, a cost for each reduced core count configuration for each of the different workloads is determined. At 408, an efficiency of each reduced core count configuration for each of the different workloads is calculated by dividing the measured performance by the determined cost for each reduced core count configuration for each of the different workloads. At 410, a first workload from the plurality of different workloads is selected. At 412, a first reduced core count configuration based on the calculated efficiency of each reduced core count configuration for the selected first workload is selected. At 414, the system is configured to execute the selected application up to the selected first workload using the selected first reduced core count configuration. In one example, the system is configured to execute a database application.
[0038] Figure 5B is a flow diagram illustrating one example of a method 420 for reconfiguring a system based on an application workload. At 422, a request to modify the system for a selected second workload from the plurality of workloads is received. At 424, a second reduced core count configuration based on the calculated efficiency of each reduced core count configuration for the selected second workload is selected. At 426, the system is reconfigured to execute the selected application up to the selected second workload using the selected second reduced core count configuration.
[0039] Methods 400 and 420 enable a deployed system to be reconfigured in place to accommodate changes to an application workload. The hardware needed to accommodate a greater workload is already in place such that by selecting a different reduced core count configuration, more cores can be activated to handle a greater workload. The hardware vendor and the software vender may work together on security measures to ensure the customer cannot activate more cores than the number of cores for which the software has been licensed.
[0040] Figure 6 illustrates an example computer system that can be employed in an operating environment and used to host or run a computer application implementing the example methods 300, 310, 400, and 420 as included on one or more computer readable storage mediums storing computer executable instructions for controlling the computer system, such as a computing device, to perform a process. In one example, the computer system of Figure 6 can be used to implement the analysis engine 224 set forth in system 200. [0041] The exemplary computer system of Figure 6 includes a computing device, such as computing device 500. Computing device 500 typically includes one or more processors 502 and memory 504. The processors 502 may include two or more processing cores on a chip or two or more processor chips. In some examples, the computing device 500 can also have one or more additional processing or specialized processors (not shown), such as a graphics processor for general-purpose computing on graphics processor units, to perform processing functions offloaded from the processors 502. Memory 504 may be arranged in a hierarchy and may include one or more levels of cache. Memory 504 may be volatile (such as random access memory (RAM)), nonvolatile (such as read only memory (ROM), flash memory, etc.), or some combination of the two. The computing device 500 can take one or more of several forms. Such forms include a tablet, a personal computer, a workstation, a server, a handheld device, a consumer electronic device, or other, and can be a stand-alone device or configured as part of a computer network, computer cluster, cloud services infrastructure, or other.
[0042] Computing device 500 may also include additional storage 508. Storage 508 may be removable and/or non-removable and can include magnetic or optical disks or solid-state memory, or flash storage devices. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any suitable method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. A propagating signal by itself does not qualify as storage media.
[0043] Computing device 500 often includes one or more input and/or output connections, such as USB connections, display ports, proprietary connections, and others to connect to various devices to receive and/or provide inputs and outputs. Input devices 510 may include devices such as a keyboard, a pointing device (e.g., mouse), a pen, a voice input device, a touch input device, or other. Output devices 512 may include devices such as a display, speakers, a printer, or the like. Computing device 500 often includes one or more communication connections 514 that allow computing device 500 to communicate with other computers/applications 516. Example communication connections can include, but are not limited to, an Ethernet interface, a wireless interface, a bus interface, a storage area network interface, and a proprietary interface. The
communication connections can be used to couple the computing device 500 to a computer network 518, which is a collection of computing devices and possibly other devices interconnected by communications channels that facilitate communications and allows sharing of resources and information among interconnected devices. Examples of computer networks include a local area network, a wide area network, the Internet, or other network.
[0044] Computing device 500 can be configured to run an operating system software program and one or more computer applications, which make up a system platform. A computer application configured to execute on the computing device 500 is typically provided as a set of instructions written in a programming language. A computer application configured to execute on the computing device 500 includes at least one computing process (or computing task), which is an executing program. Each computing process provides the computing resources to execute the program.
[0045] Although specific examples have been illustrated and described herein, a variety of alternate and/or equivalent implementations may be substituted for the specific examples shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific examples discussed herein. Therefore, it is intended that this disclosure be limited only by the claims and the equivalents thereof.

Claims

1 . A system comprising:
a processor; and
a memory communicatively coupled to the processor, the memory storing instructions executable by the processor to:
determine an optimum processor clock frequency and an optimum core voltage for each of a plurality of reduced core count configurations for a system configurable to use up to a set number of cores; and
select a reduced core count configuration from the plurality of reduced core count configurations for an application based on a performance and a cost of each reduced core count configuration while executing the application.
2. The system of claim 1 , wherein the memory stores instructions executable by the processor to further:
configure the system to execute the application using the selected reduced core count configuration.
3. The system of claim 1 , wherein the instructions to select the reduced core count configuration are executable by the processor to further:
measure a performance of the system for each reduced core count configuration while executing the application;
determine a cost for each reduced core count configuration;
calculate an efficiency for each reduced core count configuration by dividing the measured performance by the determined cost for each reduced core count configuration; and
select the reduced core count configuration for the application based on the calculated efficiency of each reduced core count configuration.
4. The system of claim 3, wherein the instructions to measure the performance of the system for each reduced core count configuration comprises instructions to measure transactions per minute executed by the system for each reduced core count configuration.
5. The system of claim 3, wherein the instructions to determine the cost for each reduced core count configuration comprises instructions to sum a hardware cost, a software cost, a power cost, and a cooling cost for each reduced core count configuration.
6. The system of claim 1 , wherein the memory stores instructions executable by the processor to further:
determine an optimum memory clock frequency and an optimum memory voltage for each reduced core count configuration.
7. A method comprising:
determining an optimum processor clock frequency and an optimum core voltage for each of a plurality of reduced core count configurations for a system configurable to use up to a set number of cores;
measuring a performance of the system for each reduced core count configuration while executing a selected application on the system;
determining a cost for each reduced core count configuration;
calculating an efficiency of each reduced core count configuration by dividing the measured performance by the determined cost for each reduced core count configuration;
selecting a reduced core count configuration based on the calculated efficiency of each reduced core count configuration; and
configuring the system to execute the selected application using the selected reduced core count configuration.
8. The method of claim 7, wherein measuring the performance of the system for each reduced core count configuration comprises measuring transactions per minute executed by the system for each reduced core count configuration.
9. The method of claim 7, wherein determining the cost of the system for each reduced core count configuration comprises determining the cost for each reduced core count configuration based on hardware costs, software costs, power costs, and cooling costs.
10. The method of claim 7, wherein configuring the system comprises configuring voltage regulator modules of the system to provide the optimum processor clock frequency and the optimum core voltage for the selected reduced core count configuration.
1 1 . The method of claim 7, wherein configuring the system comprises configuring an operating system of the system to use the cores of the selected reduced core count configuration.
12. The method of claim 7, wherein configuring the system comprises loading firmware on the system comprising commands to configure the system to the selected reduced core count configuration.
13. A computer readable storage medium storing computer executable instructions for controlling a computer system to perform a process comprising: determining an optimum processor clock frequency and an optimum core voltage for each of a plurality of reduced core count configurations for a system configurable to use up to a set number of cores;
measuring a performance of the system for each reduced core count configuration for each of a plurality of different workloads of a selected application executing on the system;
determining a cost for each reduced core count configuration for each of the different workloads; calculating an efficiency of each reduced core count configuration for each of the different workloads by dividing the measured performance by the determined cost for each reduced core count configuration for each of the different workloads;
selecting a first workload from the plurality of different workloads;
selecting a first reduced core count configuration based on the calculated efficiency of each reduced core count configuration for the selected first workload; and
configuring the system to execute the selected application up to the selected first workload using the selected first reduced core count configuration.
14. The computer readable storage medium of claim 13, wherein the process further comprises:
receiving a request to modify the system for a selected second workload from the plurality of workloads;
selecting a second reduced core count configuration based on the calculated efficiency of each reduced core count configuration for the selected second workload; and
reconfiguring the system to execute the selected application up to the selected second workload using the selected second reduced core count configuration.
15. The computer readable storage medium of claim 13, wherein configuring the system to execute the selected application at the first selected workload using the selected first reduced core count configuration comprises configuring the system to execute a database application.
PCT/US2015/011358 2015-01-14 2015-01-14 Reduced core count system configuration WO2016114771A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2015/011358 WO2016114771A1 (en) 2015-01-14 2015-01-14 Reduced core count system configuration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/011358 WO2016114771A1 (en) 2015-01-14 2015-01-14 Reduced core count system configuration

Publications (1)

Publication Number Publication Date
WO2016114771A1 true WO2016114771A1 (en) 2016-07-21

Family

ID=56406170

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/011358 WO2016114771A1 (en) 2015-01-14 2015-01-14 Reduced core count system configuration

Country Status (1)

Country Link
WO (1) WO2016114771A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220050718A1 (en) * 2020-08-12 2022-02-17 Core Scientific, Inc. Scalability advisor

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130205126A1 (en) * 2012-02-04 2013-08-08 Empire Technology Development Llc Core-level dynamic voltage and frequency scaling in a chip multiprocessor
US20140181538A1 (en) * 2012-12-21 2014-06-26 Jeremy J. Shrall Controlling Configurable Peak Performance Limits Of A Processor
US20140189240A1 (en) * 2012-12-29 2014-07-03 David Keppel Apparatus and Method For Reduced Core Entry Into A Power State Having A Powered Down Core Cache

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130205126A1 (en) * 2012-02-04 2013-08-08 Empire Technology Development Llc Core-level dynamic voltage and frequency scaling in a chip multiprocessor
US20140181538A1 (en) * 2012-12-21 2014-06-26 Jeremy J. Shrall Controlling Configurable Peak Performance Limits Of A Processor
US20140189240A1 (en) * 2012-12-29 2014-07-03 David Keppel Apparatus and Method For Reduced Core Entry Into A Power State Having A Powered Down Core Cache

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220050718A1 (en) * 2020-08-12 2022-02-17 Core Scientific, Inc. Scalability advisor

Similar Documents

Publication Publication Date Title
TWI439941B (en) Method, apparatus, and multi-core processor system for automatic workload distribution within multiprocessor system
Pattnaik et al. Scheduling techniques for GPU architectures with processing-in-memory capabilities
Mittal et al. A survey of methods for analyzing and improving GPU energy efficiency
Wang et al. Workload and power budget partitioning for single-chip heterogeneous processors
US8489904B2 (en) Allocating computing system power levels responsive to service level agreements
KR101137215B1 (en) Mulicore processor and method of use that configures core functions based on executing instructions
Wang et al. OPTiC: Optimizing collaborative CPU–GPU computing on mobile devices with thermal constraints
Tsai et al. Adaptive scheduling for systems with asymmetric memory hierarchies
Gschwandtner et al. Performance analysis and benchmarking of the intel scc
TW201337771A (en) A method, apparatus, and system for energy efficiency and energy conservation including thread consolidation
JP2013218721A (en) Method and apparatus for varying energy per instruction according to amount of available parallelism
US9811385B2 (en) Optimizing task management
WO2012170214A2 (en) System and apparatus for modeling processor workloads using virtual pulse chains
US11853787B2 (en) Dynamic platform feature tuning based on virtual machine runtime requirements
CN112346557B (en) Multi-core system and control operation thereof
Schwarzrock et al. Effective exploration of thread throttling and thread/page mapping on numa systems
Ma et al. Energy conservation for GPU–CPU architectures with dynamic workload division and frequency scaling
Chiang et al. Kernel mechanisms with dynamic task-aware scheduling to reduce resource contention in NUMA multi-core systems
KR20210007417A (en) Multi-core system and controlling operation of the same
Chen et al. Increasing off-chip bandwidth in multi-core processors with switchable pins
Chen et al. Automatic cache partitioning and time-triggered scheduling for real-time MPSoCs
WO2016114771A1 (en) Reduced core count system configuration
Haldeman et al. Exploring energy-performance-quality tradeoffs for scientific workflows with in-situ data analyses
Tasoulas et al. Performance and aging aware resource allocation for concurrent GPU applications under process variation
Rossi et al. Green software development for multi-core architectures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15878213

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15878213

Country of ref document: EP

Kind code of ref document: A1