Monitoring heterogeneous architectures

One of the main objectives of the TANGO Project is to be able to optimize energy usage of applications in an heterogeneous environment –where by heterogeneous we are understating a mixture of different processor devices, such as CPUs, GPUs, FPGAs, DSPs, and so on.- One of the main challenges is to be able to monitor energy usage of those devices without the necessity of intrusive measurements, such as adding over the top physical probes.

Luckily for us, this is a problem that hardware manufactures are also worried about and they can help us to build over technologies already available in different types of processors. In this article we are going to pass over different tools that we could use to build the energy monitoring solution for TANGO.

Monitoring the whole system:

  • External monitors – In the market, in particular in HPC environments, it is typical to install external power meters to check the consumption of each node. There are different manufacturers but, just to give an example, we could point out to the Watt’s Up Pro power meters: https://www.wattsupmeters.com/secure/products.php?pn=0 .
  • IPMI – The Intelligent Platform Manager Interface (https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface) it is an industry standard that allows the management and monitoring of physical hosts independently from the host operating system that it is quite extended now a days for different manufactures such as Bull, Dell, IBM, HP… In it’s latests versions allows the monitoring of energy consumption of the whole system.
  • DS-5 – For embedded system based in ARM processors, ARM itself offers the ARM DS-5 development Studio (https://developer.arm.com/products/software-development-tools/ds-5-devel...) that offers energy monitoring for the whole main motherboard of the system.

Energy measurements internal to the processor:

But what it is really more interesting to TANGO it is to be able to measure internally to a processor itself, depending on the processor we could be able to achieve that. Here are some interesting tools in the market today:

  • Intel RAPL – The Running Average Power Limit (blogs/tlcounts/2014/running-average-power-limit-–-rapl) it was introduced by Intel in its Intel SandyBridge processors and it is present in all the actual x86 based Intel processors. Thanks to RAPL, via processors counters, it is possible to access to how much energy is being consumed by all the cores or the GPU or DRAM package present in the processor itself (this last part it depends on the processor family).
  • AMD Power Management – Similar to Intel, from the AMD 15th x86 processor family (https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&... ), AMD offers the possibility to read how much energy is being used by the whole package.
  • NVIDIA Management Library (NVML) – For some of the GPUs designed and produced by NVIDIA, it is possible to use a set of libraries know as NVIDIA Management Library (https://developer.nvidia.com/nvidia-management-library-nvml ). This libraries allows to access to how much energy is using the whole GPU package.
  • Intel Xeon Phi – For its many core processor architecture know as Xeon Phi (http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html ), Intel offers a System Management Controller (SMC). Via this SMC it is possible to access to the total energy usage for this type of processor (global energy for all the cores in the package).

The most important advantages is that for those processors it is possible to access to those energy values using the facto standard to access processor counters: PAPI. In its version 5 it is able to access to the previous energy counters for all those processors: http://icl.cs.utk.edu/news_pub/submissions/ispass2013_papi.pdf .

As you can see, we are omitting a bit FPGAs in the previous examples. For this it is necessary to study each FPGA coprocessor board provider by the different manufactures together with the proprietary software that comes with the board. This is planned to be done in the future for one of the use cases in the TANGO project.

The monitoring solution that comes from TANGO project is planned to be integrated both in SLURM (http://slurm.schedmd.com/ ) and CollectD Monitoring System (https://collectd.org/ ).