Tuesday, September 24, 2013

Another Glenn Analysis: Xeon Phi Adoption

Oh, dear. Glenn Lockwood's at it again: analyzing public data and drawing conclusions from it. Someone might get the wrong idea and accuse him of being a scientist. In this case, he's looking at usage of the Xeon Phi Coprocessor on TACC's Stampede system, and how promotional hype might influence others to purchase coprocessors, accelerators, etc., before clear levels of adoption and usage patterns have been shown.

Humor (facetious or not) aside, determining the optimal system for a campus (university, etc.) to deploy should not be driven solely by hype, and from my experience, it isn't. In particular the UCSD condo computing (c.f., LBNL, UCLA, Rice) system TSCC includes both standard (CPU-only) and GPU-enable (e.g., standard compute + GPU) nodes. The important lesson is that the amount of GPU (accelerated) nodes is determined by participant buy-in. Principal investigators have chosen the number of CPU-only and CPU+GPU nodes to put into the cluster based on their groups' research needs.

For this condo computing case I think the situation is well in hand, because it's market-driven. Where Glenn's legitimate concern arises is when someone needs to predict adoption based on the available data, which may only be a survey of current systems. This is akin to designing a car based on what's presently sold. Seriously--there is nothing about other clusters that expresses what your needs are. I support (that is, my group administers) both CPU-only clusters and fully GPU-enabled clusters, and a mix. My observation is that adoption rates are wholly application dependent. I'm convinced that the CPU-only and fully GPU-enabled clusters we administer that are running at capacity were targeted for the right applications.

Glenn hints at the experience we've had with Gordon at SDSC in getting research groups to use flash storage. The clear lesson is that some researchers will be able to take advantage of new technologies almost immediately (MPI for domain-decomposed physics simulations, GPUs for molecular dynamics), while the rest of the science community enters a "try before you buy" phase. This is where the benefits of national systems come in.

Machines like Gordon, Stampede, or Keeneland allow researchers to test new technologies with less risk. Instead of purchasing dedicated hardware for internal use as part of a grant, the NSF makes these systems available through XSEDE for everyone to use. The allocation process is very similar in effort to writing a grant, but the overall capital investment is greatly reduced, and the time commitment (typically six months to a year) is much less than the lifetime of an idling piece of equipment in an office. (I'm sorry, I gained my experience as a system engineer maintaining a cluster as a grad student. The end result is positive, but I chafe at the lack of efficiency.)

So, I agree with Glenn's note of caution to universities deciding what kind of cluster to buy: Don't decide based on what's current in production at major supercomputing centers. Instead, develop a business model that allows you to bring in what your campus researchers need. In any funding climate, you will want to maximize your return on investment, and money spent on unused equipment could have been leveraged for so many other opportunities.

No comments:

Post a Comment