2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
My Colfax  

Parallel Programming and Optimization with Intel® Xeon Phi™™ Coprocessors, 2nd Edition [508 Pages]


Handbook on the Development and Optimization of Parallel Applications for Intel® Xeon® Processors and Intel® Xeon Phi™™ Coprocessors

US $29 [PDF]

  • An example-based intensive guide for programming Intel® Xeon Phi™ coprocessors
  • Introduction to task- and data-parallel programming with MPI, OpenMP, Intel Cilk Plus, and automatic vectorization with the Intel C++ compiler
  • Extensive discussion of high performance computing (HPC) application optimization on the Intel® Xeon® and Intel® Xeon Phi™ platforms, including scalar optimizations, improvement of SIMD operations, multithreading, efficient cache utilization, and scaling across heterogeneous distributed-memory computing platforms
  • Supplements Colfax in-class or self-paced training with dedicated access to a computing system with Intel® Xeon Phi™ coprocessors (Colfax Developer Training)


Publication Date: May 2015 | ISBN 978-0-9885234-2-5 (electronic), ISBN 978-0-9885234-0-1 (print) | Edition: 2

This book will guide you to the mastery of parallel programming with Intel® Xeon® family products: Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. It includes a detailed presentation of the programming paradigm for Intel® Xeon® product family, optimization guidelines, and hands-on exercises on systems equipped with the Intel® Xeon Phi™ coprocessors, as well as instructions on using Intel software development tools and libraries included in Intel Parallel Studio XE.

This book is targeted toward developers familiar with C/C++ programming in Linux. Developers with little parallel programming experience will be able to grasp the core concepts of these subjects from the detailed commentary in Chapter 3. For advanced developers familiar with multi-core and/or GPU programming, the ebook offers materials specific to Intel compilers and Intel® Xeon® family products, as well as optimization advice pertinent to Many Integrated Core (MIC) architecture.

We have written these materials relying on key elements for efficient learning: practice and repetition. As a consequence, the reader will find a great number of code listings in the main section of these materials. In the extended appendix, we provided numerous hands-on exercises that one can complete either under an instructor’s supervision, or autonomously in a self-paced training environment.

This document is different from a typical book on computer science, because we intended it to be used as a lecture plan in an intensive learning course. Speaking in programming terms, a typical book traverses material with a “depth-first algorithm”, describing every detail of each method or concept before moving on to the next method. In contrast, this document traverses the scope of materials with a “breadth-first” algorithm. First, we give an overview of multiple methods to address a certain issue. In the subsequent chapter, we re-visit these methods, this time in greater detail. We may go into even more depth down the line. In this way, we expect that developers will have enough time to absorb and comprehend the variety of programming and optimization methods presented here.

About the Authors

Andrey Vladimirov, PhD, is Head of HPC Research at Colfax International. His primary interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, A. Vladimirov was involved in computational astrophysics research at Stanford University, North Carolina State University, and the Ioffe Institute in Russia, where he studied cosmic rays, collisionless plasmas and the interstellar medium using computer simulations.

Ryo Asai, is a Researcher Colfax International. He develops optimization methods for scientific applications targeting emerging parallel computing platforms, computing accelerators and inter connect technologies. Ryo holds a B.S. degree in Physics from University of California, Berkeley.

Vadim Karpusenko, PhD, is Principal HPC Research Engineer at Colfax International involved in training and consultancy projects on data mining, software development and statistical analysis of complex systems. His research interests are in the area of physical modeling with HPC clusters, highly parallel architectures, and code optimization. Vadim holds a PhD from North Carolina State University for his research in in the field of computational biophysics on the free energy and stability of helical secondary structures of proteins.

What's New in Second Edition?

Second edition is a major revision of the book. New features include:

  • Revised practical exercises tuned for the behavior of the latest software tools in Intel Parallel Studio XE 2015
  • Obsoleted information on older versions of MPSS 2.x replaced with current information for MPSS 3.x
  • 40% of the exercises are new with the updates targeted to efficient learning
  • All exercises are revised for improved workflow (instructions located next to source code) and user experience (standardized performance reporting)
  • New topics discussed in the text: networking in clusters with coprocessors, upcoming second generation of coprocessors, additional optimization topics
  • Improved layout: large fonts optimized for reading the PDF file on a computer screen
  • Significant changes in the text based on reader feedback improve clarity and flow

Overall, if you own the 1st edition of the book, it remains a valid introduction into parallel programming with Intel Xeon Phi coprocessors. However, if you need up-to-date practical programming and optimization recipes or classroom material, it is worth upgrading to the 2nd edition.

Writing a print book about any parallel computing topic is daunting for several reasons. As the author of 23 technical books, I can attest to the fact that technology can become out of date before the book is published. This is the most difficult task in writing a print book—keeping things up to date for publication. The second difficulty is covering the enormous amount of information in such a way that it fits into a print book, yet has enough depth to provide usable information. The third difficulty is addressing a wide audience so that everyone gets the value of the information, whether parallel programming newcomer or veteran. In my opinion, the authors of "Parallel Programming and Optimization With Intel Xeon Phi Coprocessors, 2nd Edition" have done a phenomenal job on all three counts. The book is current, provides information that is directly applicable, and can be effectively read by a wide range of programmers.

The authors, Andrey Vladimirov, Ryo Asai and Vadim Karpusenko, start off talking about the Xeon and Xeon Phi technology, and how it differs from previous multicore processors. They also make a bold claim that the book takes a platform agnostic approach, and presents concepts in a portable manner. I did find this to be true throughout the book, so regardless of the operating system, the concepts presented apply. Chapter 3 is processor agnostic, so applicable even for pre-Xeon processors. Even though much of the information is applicable across the board, they do point out ways to optimize for the Xeon Phi if that is your target.

Native and Offload Models

The second chapter is gold. It provides a clear explanation of the two prevalent programming models for the Xeon Phi. They are the offload and native models. Native programming allows an executable program to be transferred to a coprocessor. Once transferred, the executable can run without the involvement of the host. This is made possible in large part because the Xeon Phi runs a Linux operating system along with a virtual file system and multi-user environment. Fortunately, Intel has taken care of all of the details by adding options to their compilers that facilitate building as a native executable. In contrast to what is normally called CPU parallelization, the entire native executable is run on the Xeon Phi rather than portions being doled out to processor cores.

The offload model resembles the native model, but has differences. When using the offload model, the executable begins execution on the host. At any point, though, some sections of code and data can be offloaded to the coprocessor, and executed there. As with the native model, Intel has made the offload model easy and straightforward. A set of pragma statements can be used to offload code portions. For instance, the following example offloads a small amount of code from a program that is otherwise running on the host.

Parallel Programming Paradigms

While Chapter 3 does not present anything that is new, it is an important chapter for the book since it provides a foundation of parallelization using programming language extensions. This includes vectorization, OpenMP, Cilk, and MPI. All four of these are cornerstones of modern parallelization, and an essential element of this book. As I pointed out earlier, this book makes an effort to address a wide audience. Inclusion of this material means that newcomers to parallelization will not have to seek other references in order to learn the basics of, say, a parallel for loop.

The material is not only important, but it is well-written and fairly complete. This concisely written chapter will be what I reach for first when I need to double check OpenMP syntax. What I especially appreciate is that the examples are all simple and to the point. Many times authors throw in several ideas, and it can be hard to separate them. Here, each idea is demonstrated on its own and crystal clear.

Optimizing Parallel Applications

Chapter 4 is so full of meat that it will take me months to digest. It tackles optimization of parallel applications, especially with regards to Xeon and Xeon Phi processors. I really appreciated the optimization checklist. Developers often think they have exhaustively optimized code, only to discover later on that they missed an opportunity. With this checklist, you can make sure that you have considered everything.

There is a lot in this chapter, but the most useful material for me for the type of development I do was the section of optimization of transcendental functions. I often need to crunch large sets of data by performing math on each element. For instance, to calculate the standard deviation of a data set requires squaring numbers and also finding square roots. Taking advantage of the Xeon Phi technology can provide a 2x or even 3x advantage over using the normal math libraries. For any application that crunches numbers, this can make the difference if you heed the advice of the authors.

Software Development Tools

When the rubber hits the road, you need to find the tools that will deliver on the Xeon Phi promises. Chapter 5 is a roadmap to the tools. As I read the chapter, I realized how much time it was saving me. Instead of sifting through dozens of online articles to find the answers, it is summarized in this chapter. Most of this chapter revolves around Intel’s Parallel Studio. The authors clarify how to get the most of this product.


If you have any aspirations of taking advantage of Xeon and Xeon Phi processors, this book is a must-have. If you just want a concise overview of parallelization, this book is also a must have. You won’t read and master the material in a week. But I plan to work through the entire book, using it to hone my skills before most other developers do, which will give me a distinct advantage."

- Rick Leinecker, Contributing Editor, Slashdot Media