Apr 262013
 

This week NVIDIA provided a tutorial outlining first steps for GPU acceleration using OpenACC and CUDA. This was offered as part of the “GPUs Accelerating Research” week at Northeastern University and Boston University. After attending, it seemed appropriate to review and boil it down to the essentials. There’s no way we can cover everything, but sometimes getting started is the hardest part.

Set aside half an hour, four hours, eight hours; whatever you can spare. The steps below will get your code up and running on GPUs. The resources at the end will be necessary for those seriously digging into acceleration.

You very likely have a decent GPU in an existing laptop, workstation or server. While NVIDIA Tesla GPUs are typically best for computation, a GeForce or Quadro card will be fine for your first development attempts. You are welcome to contact us for a GPU Test Drive on Tesla once you have working code and want to see the speedups available from professional cards.

Continue reading »

Apr 222013
 

The Intel® Xeon® Phi™ is the first x86-based accelerator for heterogeneous HPC environments. A parallel coprocessor with 60 x86 compute cores, Xeon Phi 5110P delivers over 1 TFLOPS of double precision floating point performance. Based upon the Many Integrated Core architecture (MIC) project and development, Xeon Phi is used to scale out traditional HPC cluster designs. The passively cooled Xeon Phi 5110P we are shipping today offers a great balance of ease of use, performance, energy efficiency and cost in a solution optimized for highly parallel HPC workloads.

Phi’s primary advantage is ease of use. Being based upon Intel’s underlying x86 architecture, it is at once familiar and flexible. Many common parallel programming standards are supported meaning that the HPC community will find it easy to port and optimize existing programs to achieve much greater performance levels. Intel’s MIC Architecture programming models are open-standard and portable between Xeon processors and Xeon Phi coprocessors. Using widely available Intel libraries, compilers and debuggers will accelerate the evolution of programs to run on Xeon / Xeon Phi stacks.

Buzz is already building for Phi in many demanding High Performance Computing applications. Homa Karimabadi, a Space Physics Group Leader at the University of California, San Diego has shared that “we have really enjoyed exploring and testing the performance capabilities of the Intel Xeon Phi coprocessor. The integrated Intel tool chain allowed us to take code written for Intel Xeon processors and execute on the coprocessors with minimal to no changes – We have seen impressive results on large matrix tests and it’s clear that the compute capabilities have jumped. We look forward to working with Intel and exploring this technology further.”

PERFORMANCE

Phi is Ideal for:

  • High density environments
  • Highly parallel applications using over 100 threads
  • Memory bandwidth‐bound applications
  • Applications with extensive vector use

Xeon Phi 5110P Key specifications:

  • 60 cores/1.053 GHz/240 threads
  • 8 GB memory and 320 GB/s bandwidth
  • Standard PCIe* x16 2.0 device, passively cooled
  • Linux* operating system, IP addressable
  • Wide 512‐bit vector units

Built using Intel’s 22nm process technology—Intel’s most energy efficient process yet—featuring the world’s first 3-D tri-gate transistors, Phi provides the opportunity to add raw processing power at large scale while still living within many existing power constraints. A host of tools from independent vendors, up to and including IBM, are coming soon that will enable developers to manage parallelization, processor and core workload management to optimize performance and minimize wasted clock cycles (and energy).

Intel has provided the following overview and application information for reference:

Intel Xeon Phi Coprocessor 5110P

Intel-Xeon-Phi-Coprocessor-Increases-Application-Performance-up-to-10x

Intel claims to have reached beyond 1 TFLOPS of double precision peak performance in specific applications with Phi, the highest parallel performance per watt of any Xeon processor. The advent of a coprocessor that can also host an operating system opens up interesting new possibilities such as offloading serial tasks back to the host, and further maximizing compute resources. Microway is excited to be shipping Intel Xeon Phi based servers today. Xeon Phi based WhisperStations will ship in Q3 of 2013.

At Microway, We Speak HPC™, and we speak Intel Xeon Phi. Talk to a Microway advisor about how Intel Xeon Phi running in our NumberSmasher Servers may provide you an easier and lower cost roadmap to scalable performance.

Contact us at WeSpeakHPC@Microway.com.

Mar 222013
 

You may have heard the buzz about Nvidia’s Maximus technology. While the buzz has been around for a little while, the latest product release has been worth the wait. The essence of Maximus is the unification of the driver base for Tesla GPUs and Quadro GPUs. With Maximus, CUDA device application becomes transparent to both developers and users. The upshot is that Maximus allocates workloads between compute (simulation and rendering) which goes to the Tesla GPU card(s) and design (visualization) going to the Quadro GPU card(s).

By embedding this functionality at the driver level, Nvidia makes the world simpler for both high performance computing developers and HPC users. More importantly, Maximus enables new workflow models that can dramatically reduce the time to iterate new designs and even enables restructuring of your design and simulation departments. Nvidia provides a handy graphic to represent the advantages of their new technology, shown here.

Nvidia Maximus Workstation vs Traditional Workstation Chart

The ability to perform simultaneous CAE, rendering or multi-physics analysis such as structural dynamics, thermal analysis or computational fluid dynamics on the same system being used for design work enables productivity breakthroughs. In the past, the development process was dictated by the technological constraints of the workstation, whether one is developing a consumer product or a hit movie. Now, the process of design, development and production can be optimized by working in the most natural or efficient way. The design of workflows, and even departments and divisions that used to be driven by the split between design and simulation can now be rethought and retooled.

Because of faster design / simulation cycles, the result is faster time to value, with end products that are of higher quality reaching the market sooner.

With 2nd generation Maximus, performance per watt is significantly increased. Compared to prior designs, the new Kepler-based gpu cards feature a 3X performance per watt improvement. With this unparalleled leap in performance per watt, Nvidia is now enabling a workstation roadmap that has the potential to disrupt conventional thinking about how design, simulation and rendering can, and should be accomplished.

Nvidia sees the ‘age of the hybrid designer’ coming, who will be designing and simulating in real-time within the same application. Industries such as manufacturing design, energy and media and entertainment will be leveraging distributed, workstation-based supercomputing performance on currently available applications. We believe that ultimately a wide range of advanced applications that benefit from high-quality visualization and heavy computational requirements will be developed. New, deeply interactive design and engineering processes will enable the evaluation of more design variations in less time. We anticipate that the result will be higher-quality products, more product innovation, higher efficiency products developed in less time, with fewer delays and unanticipated flaws.

We’re now in a ‘future’ where complex animations and film scenes can be edited in real-time, even with complex ‘physics’ such as rain and wind. Where energy companies can interpret complex data sets and produce high-resolution visualizations of geological formations in record time. Where HPC professionals don’t have to schedule jobs to run at night or during lunch breaks to manage compute workloads. Nvidia Maximus will reduce or eliminate the processing bottlenecks that are the inevitable result of using shared compute resources. This will produce nothing short of a revolution in design and engineering speed, efficiency and productivity.

At Microway, We Speak HPC™, and we speak Maximus. Talk to a Microway advisor about how Maximus technology running on our WhisperStation™ can revolutionize the return on your engineering workstation investment. Contact us at WeSpeakHPC@Microway.com.

 

Feb 132013
 

In 2004 Google released a white paper on their use of the MapReduce framework to perform fast and reliable executions of similar processes / data transformations & queries at terabyte scale. Yahoo then began the Hadoop project to support their search product. As a result of this, Apache elevated Hadoop, their MapReduce and DFS (distributed file management system) initiative out of Nutch, their open source search project.

Although technically Hadoop is still in pre-release 1.0, it has proven to be stable and useful for Big Data web 2.0 applications. When you are using Google, LinkedIn, Facebook, Twitter and Yahoo! you are running on Hadoop.

What about Hadoop for High Performance Computing with scientific applications?  It certainly has its place and a basic understanding of Hadoop helps you to understand where you can take advantage of Hadoop in HPC.

Firstly, what is MapReduce?  MapReduce is a methodology of performing parallel computations on very large volumes of data , by dividing the workload across a large number of similar machines, called ‘nodes’. Map Reduce methodology enables linear scalability through good data and file management. Additionally, Map Reduce differs from other methodologies in that it relies on nodes which are servers with attendant disk storage. Work is allocated to these storage server nodes based upon where the data is, as opposed to moving data to where processing occurs. This dramatically accelerates applications which process Big Data sets.

With Map – Reduce, you ‘map’ your input data to  the type of output you desire using some function that is replicable. For instance in manipulating strings by substituting a space for a comma in all input data. Or counting the number of occurrences of each word in a book. ‘Reducing’ aggregates the mapped data together into useful results, perhaps through functions such as addition and subtraction.

Much like RedHat with Linux, there are now commercial releases of Hadoop such as Cloudera that provide tools to simplify Hadoop implementation as well as reliable technical support. Hadoop itself provides built-in fault tolerance through triplicate copies of data distributed across processing nodes, enabling a robust implementation ‘out of the box’. Whereas GPFS and Lustre have scaled across hundreds of servers, known Hadoop implementations have successfully scaled across tens of thousands of nodes.

So what does all this mean for HPC, scientific and engineering applications?  Microway sees Hadoop as an excellent addition to the stack for data intensive scientific applications. This can include bioinformatics, physics and weather modeling applications.  Hadoop can also accelerate science when the workloads include a series of queries of very large data sets. Additionally, when scaling science from the desktop up to larger workloads, Hadoop can provide an effective transition model.

A few examples of Microway Hadoop solutions include the NumberSmasher 1U, 2U and 4U servers.. With one to four multi-core Xeon CPUs, 512GB memory and up to 120TB storage, the NumberSmasher servers are flexible and cost-effective. Microway will build your cluster for you – whether it’s four nodes or a hundred nodes.

We speak HPC, and we speak Hadoop! To learn more about how Hadoop can accelerate your science and engineering workloads feel free to reach a specialist at wespeakhpc@microway.com.

Resources:

Sep 072012
 

NVIDIA is now shipping their 4.58 TFLOPS single-precision floating point GPUs. The Tesla K10 GPU Accelerators, based upon the Kepler GK104 architecture, are the first Teslas available from this new generation of products. They are designed for single-precision float-point applications, so double-precision users will need to wait until winter for the Tesla K20 GPU Accelerators.

Although Tesla K10 is weak in double-precision performance, many users will find it has a lot to offer:

  • PCI-Express x16 generation 3.0 link
  • Dual GK104 GPUs, each with 4GB GDDR5 Memory
  • 8GB Total Memory (2 x 160 GB/sec peak bandwidth)
  • 4.58 TFLOPS peak single-precision floating point
  • 190 GFLOPS peak double-precision floating point

Continue reading »

May 162012
 

Compute performance has been exponentially increasing for the entirety of your life – it doesn’t matter what your age is. This week at NVIDIA’s GTC 2012 conference, we’ve seen that GPUs are still leading the charge. The new NVIDIA “Kepler” K10 and K20 GPU Accelerators will be offering 4.58 TFLOPS single-precision and over 1 TFLOPS double-precision, respectively.
NVIDIA Tesla K10 GPU Accelerator
In today’s “Inside Kepler” session Lars Nyland, from NVIDIA’s architecture group, and Stephen Jones, from the NVIDIA CUDA group, dove into the improved architecture and programmability of the GK110 GPU.

Continue reading »

Apr 132012
 

Intel has once again done an excellent job designing a high-performance processor. The new Xeon E5-2600 “Sandy Bridge EP” processors run as much as 2.2 times faster than the previous-generation Xeon 5600 “Westmere” processors. Combined with new Xeon server/workstation platforms, they will be extremely attractive to anyone with computationally-intensive needs.

The new Intel architecture provides many benefits right out of the box, while others may require changes on your end. Read on to make sure you’re achieving the best performance.
Continue reading »

Jan 132012
 

I think everyone in the HPC arena has heard plenty about GPUs. GPUs aren’t sophisticated like CPUs, but they provide raw performance for those who know how to use them. The question for those who have large computational workloads has been: Do I have the time, energy and know-how to take advantage of GPUs?

With the 2x in 4 Weeks promotion, NVIDIA and PGI are hoping to demonstrate that almost anyone can succeed. The new OpenACC directives standard allows the compiler to accelerate code without a complete re-write or digging into CUDA. Your application can be running twice as fast (or more) with less than a month of work. And if you register with Microway, you can easily use our 8-GPU Tesla SimCluster to recompile your application for GPUs and see the speedup! This post will walk you through the process.
Continue reading »

Dec 052011
 

Most users know how to check the status of their CPUs, see how much memory is free or find out how much disk space is free. In contrast, keeping tabs on the health and status of GPUs has historically been more difficult. If you don’t know where to look, it can even be difficult to determine the type and capabilities of the GPUs in a system. Thankfully, NVIDIA’s latest hardware and software tools have made good improvements in this respect.
Continue reading »