750 Palomar Ave, Sunnyvale, CA 94085 | 408-730-2275
My Colfax    

White Papers


Machine Learning on Intel® Xeon Phi™ Processors: Image Captioning with NeuralTalk2, Torch

Posted: June 20, 2016

In this case study, we describe a proof-of-concept implementation of a highly optimized machine learning application for Intel Architecture. Our results demonstrate the capabilities of Intel Architecture, particularly Intel Xeon Phi processors (formerly codenamed Knights Landing) in the machine learning domain.

Click here to learn more


3 Papers to Help you get up to Speed with Knights Landing - Automatic Vectorization with Intel AVX-512 Instructions, Clustering Modes and MCDRAM as High-Bandwidth Memory (HBM)

Posted: May 11, 2016

The 3 papers discuss the new features in Knights Landing, practical usage tips and guidelines for determining the optimal usage model for applications migrating to bootable Knights Landing platform

  • Automatic Vectorization with Intel AVX-512 Instructions
  • Clustering Modes
  • MCDRAM as High-Bandwidth Memory (HBM)

Click here to learn more


Introduction to Intel® Data Analytics Acceleration Library (Intel® DAAL), Part 2: Distributed Variance-Covariance Matrix Computation

Posted: March 28, 2016

This is the part 2 of 3 of an introductory series of publications on the Intel Data Analytics Acceleration Library (DAAL). DAAL is a data analytics library optimized for modern highly parallel computer architectures such as Intel Xeon and Intel Xeon Phi processors. The goal of this series is to provide developers a technical overview for developing applications using DAAL.

In part 1 of the series we discussed how to implement batch mode computation on a single node.

In the present publication, we discuss the distributed mode computation. Our discussion will focus both on how and when to implement distributed mode computation with Intel DAAL.

As an example workload, we implement an application that uses DAAL to compute a covariance matrix of a set of vectors. We first demonstrate how to use distributed mode with this example. Then, using this example application, we scan the parameter space to determine what parameter ranges benefit from distributed computation.

We also demonstrate how the output of this computation may be used in image processing to compute the eigenvectors of a set of images. The source code for this application is available for free download.

In the upcoming 3rd part of the series, we will discuss the online computation mode, using an example workload with multiple datasets and interfacing with a relational database via SQL.

Click here to learn more


Introduction to Intel® Data Analytics Acceleration Library (Intel® DAAL), Part 1: Polynomial Regression with Batch Mode Computation

Posted: October 28, 2015

This is the part 1 of 3 of an introductory series of publications on the Intel Data Analytics Acceleration Library (DAAL). DAAL is a data analytics library optimized for modern highly parallel computer architectures such as Intel Xeon and Intel Xeon Phi processors. The goal of this series is to provide developers a technical overview for developing applications using DAAL.

In this paper we focus on two aspects of developing an application with Intel DAAL: data management and computation. As a practical example, we implement a simple machine learning application with polynomial regression using the library in the batch computation mode. We demonstrate using this application for data-based prediction of hydrodynamics properties of yachts. The source code and data for the sample application are available for free download.

The second and third part of the series will discuss other aspects of data analysis with DAAL. In part 2, we discuss distributed data and computation in conjunction with MPI. In the third part, we discuss the case with multiple data sets and interfacing with a relational database using SQL.

Click here to learn more


Optimization Techniques for the Intel MIC Architecture. Part 3 of 3: False Sharing and Padding

Posted: August 08, 2015

This is part 3 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors).

In this paper we discuss false sharing, highlighting the situations in which it may occur, and eliminating it with the help of data container padding.

For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Similar workloads occur in Monte Carlo simulations, particle physics software, and statistical analysis.

Results show that the impact of false sharing may be as high as an order of magnitude performance loss in a parallel application. On Intel Xeon processors, padding required to eliminate false sharing is greater than on Intel Xeon Phi coprocessors, so target-specific padding values may be used in real-life applications.

Click here to learn more


Software Developer’s Introduction to the HGST Ultrastar Archive Ha10 SMR Drives

Posted: July 31, 2015

In this paper we will discuss the new HGST Shingled Magnetic Recording (SMR) drives, Ultrastar Archive Ha10, which offers storage capacities of 10 TB and beyond. With their high-density storage capacities, these drives are well suited for large “active archive” applications. In an active archive application, the data is frequently read but seldom modified.

The SMR drives are host managed, meaning that the developer must manage the data storage on the drives. In this publication we introduce an open source library, libzbc, which was developed by the HGST team to assist developers who use SMR drives. The discussions cover topics from the very basics like opening a device, to more advanced topics like data padding. The goal of this paper is to give readers the necessary knowledge and tools to develop applications with libzbc.

We will present an example, and then report several benchmarks of I/O operations on the HGST SMR drives, and discuss the SMR drive’s effectiveness as an active archive solution.

Click here to learn more


Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization

Posted: June 26, 2015

This is part 2 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors).

In this paper we discuss data parallelism. Our focus is automatic vectorization and exposing vectorization opportunities to the compiler.

For a practical illustration, we construct and optimize a micro-kernel for particle binning particles. Similar workloads occur applications in Monte Carlo simulations, particle physics software, and statistical analysis.

The optimization technique discussed in this paper leads to code vectorization, which results in an order of magnitude performance improvement on an Intel Xeon processor. Performance on Xeon Phi compared to that on a high-end Xeon is 1.4x greater in single precision and 1.6x greater in double precision.

In part 3 we will revisit thread parallelism and experience a close (and victorious) encounter with another enemy of performance: false sharing. Stay tuned!

Click here to learn more


Optimization Techniques for the Intel MIC Architecture. Part 1 of 3: Multi-Threading and Parallel Reduction

Posted: May 29, 2015

This is part 1 of a 3-part educational series of publications introducing select topics on optimization of applications for the Intel multi-core and manycore architectures (Intel Xeon processors and Intel Xeon Phi coprocessors).

In this paper we focus on thread parallelism and race conditions. We discuss the usage of mutexes in OpenMP to resolve race conditions. We also show how to implement efficient parallel reduction using thread-private storage and mutexes. For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Workloads like this one occur in such applications as Monte Carlo simulations, particle physics software, and statistical analysis.

The optimization technique discussed in this paper leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures.

In the next publication of this series, we will demonstrate further optimization of this workload, focusing on vectorization. Stay tuned!

Click here to learn more


Fine-Tuning Vectorization and Memory Traffic on Intel® Xeon Phi™ Coprocessors: LU Decomposition of Small Matrices

Posted: January 27, 2015

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel Xeon Phi coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128x128.

Benchmarks show that the discussed optimizations improve the performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor.

Click here to learn more


Performance to Power and Performance to Cost Ratios with Intel® Xeon Phi™ Coprocessors (and why 1x Acceleration May Be Enough)

Posted: January 27, 2015

The paper studies two performance metrics of systems enabled with Intel Xeon Phi coprocessors: the ratio of performance to consumed electrical power and the ratio of performance to purchasing system cost, both under the assumption of linear parallel scalability of the application.

Performance to power values are measured for three workloads: a compute-bound workload (DGEMM), a memory bandwidth-bound workload (STREAM), and a latency-limited workload (small matrix LU decomposition). Performance to cost ratios are computed, using system configurations and prices available at Colfax International, as functions of the acceleration factor and of the number of coprocessors per system. That study considers hypothetical applications with acceleration factor from 0.35x to 2x.

In all studies, systems with Intel Xeon Phi coprocessors yield better metrics than systems with only Intel Xeon processors. That applies even with acceleration factor of 1x, as long as the application can be distributed between the CPU and the coprocessor.

Click here to learn more


Intel Cilk Plus for Complex Parallel Algorithms: "Enormous Fast Fourier Transforms" (EFFT) Library

Posted: September 18, 2014

In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley-Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on single-threaded Intel Math Kernel Library (MKL) implementation of DFFT, and uses the Intel Cilk Plus framework for thread parallelism. We demonstrate the ability of Intel Cilk Plus to handle parallel recursion with nested loop-centric parallelism without tuning the code to the number of cores or cache metrics. The result of our work is a library called EFFT that performs 1D DFTs of size 2^N for N>=21 faster than the corresponding Intel MKL parallel DFT implementation by up to 1.5x, and faster than FFTW by up to 2.5x. The code of EFFT is available for free download under the GPLv3 license. This work provides a new efficient DFFT implementation, and at the same time demonstrates an educational example of how computer science problems with complex parallel patterns can be optimized for high performance using the Intel Cilk Plus framework.

Click here to learn more


Installing Intel MPSS 3.3 in Arch Linux

Posted: August 20, 2014

This technical publication provides instructions for installing the Intel Manycore Platform Software Stack (MPSS) version 3.3 in Arch Linux operating system. Intel MPSS is a suite of tools necessary for operation of Intel Xeon Phi coprocessors. Instructions provided here enable offload and networking functionality for coprocessors in Arch Linux. The procedure described in this paper is completely reversible via an uninstallation script.


File I/O on Intel Xeon Phi Coprocessors: RAM disks, VirtIO, NFS and Lustre

Posted: July 28, 2014

A unique portability feature of Intel Xeon Phi coprocessors is that applications can write and read local and remote files directly from the accelerator card as if it were just another compute node in the cluster. This allows programmers to take existing applications written for CPU-based clusters and port it to accelerators with only recompilation. All threading frameworks, communication protocols and file I/O facilities will work on the accelerator as long as they are properly configured.

The paper provides more details on I/O support in Intel Xeon Phi coprocessors:

  • Summarizes file storage systems accessible from Xeon Phi: RAM disks, local and distributed disk-based storage (including the scalable high-performance Lustre file system)
  • Provides benchmarks and discussion that help the system administrator to pick and configure the best storage option for the task at hand.
  • Studies the parallel scalability of file I/O, assisting the programmer in the decisions on the optimization of parallelism in the application

Click here to learn more


Cluster-Level Tuning of a Shallow Water Equation Solver on the Intel MIC Architecture

Posted: May 12, 2014

The paper demonstrates the process of porting a computational fluid dynamics (CFD) application to a cluster enabled with Intel Xeon Phi coprocessors and interconnected by InfiniBand links.

  • The application solves equations of shallow water flow, which is a CFD problem important for weather and climate modeling
  • Only one line of legacy Fortran code had to be modified in order to achieve scalability across multiple Intel Xeon Phi coprocessors, and the hybrid OpenMP/MPI runtime environment needed to be tuned in order to efficiently utilize the MIC architecture
  • Each Intel Xeon Phi coprocessor contributes 1.6x more performance than a top-of-the line dual-socket multi-core CPU, and with 4 coprocessors per compute node, the application runs 5.6x faster on the MIC architecture than on CPUs
  • Methods discussed in the paper are applicable to other memory bandwidth-bound stencil codes for distributed-memory systems

This work is a collaboration between the researchers of Colfax International, USA and University of Liverpool, UK.

Click here to learn more


Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors

Posted: March 11, 2014

Intel Xeon Phi coprocessors allow symmetric heterogeneous clustering models, in which MPI processes are run fully on coprocessors, as opposed to offload-based clustering. These symmetric models are attractive, because they allow effortless porting of CPU-based applications to clusters with manycore computing accelerators.

However, with the default software configuration and without specialized networking hardware, peer-to-peer communication between coprocessors in a cluster is quenched by orders of magnitude compared to the capabilities of Gigabit Ethernet networking hardware. This situation is remedied by InfiniBand interconnects and the software supporting them.

In this paper we demonstrate the procedures for configuring a cluster with Intel Xeon Phi coprocessors connected with Gigabit Ethernet as well as InfiniBand interconnects. We measure and discuss the latencies and bandwidths of MPI messages with and without the advanced configuration with InfiniBand support. The paper contains a discussion of MPI application tuning in an InfiniBand-enabled cluster with Intel Xeon Phi Coprocessors, a case study of the impact of InfiniBand protocol, and a set of recommendations for accommodating the non-uniform RDMA performance across the PCIe bus in high performance computing applications.

Click here to learn more


"Heterochromic" Computer and Finding the Optimal System Configuration for Medical Device Engineering

Posted: January 27, 2014

Designing a computing system configuration for optimal performance of a given task is always challenging, especially if the acquisition budget is fixed. It is difficult, if not impossible, to analytically resolve all of the following questions:

  • How well does the application scale across multiple cores?
  • What is the efficiency and scalability of the application with accelerators (GPGPUs or coprocessors)?
  • Should measures be taken to prevent I/O bottlenecks?
  • Is it more efficient to scale up a single task or partition the system for multiple tasks?
  • What combination of CPU models, accelerator count, and per-core software licenses gives the best return on investment?

Rigorous benchmarking is the most reliable method of ensuring the "best bang for buck", however, it requires access to the computing systems of interest. Colfax takes pride in being able to offer interested customers opportunities for deducing the optimal configuration for specific tasks.

Recently we received a request from Peter Newman, Systems Engineer at Carestream Health, for evaluating the performance of the software tool ANSYS Mechanical on Colfax's computing solutions. His goal was to find the optimum number of computing accelerators (if any) and software licenses that he needed to purchase in order to achieve the best performance of specific calculations in ANSYS.

In order to allow Mr. Newman to seamlessly benchmark a variety of system configurations, we provided him access to a unique machine built by Colfax, based on an Intel Xeon E5 CPU, and supporting four Nvidia Tesla K40 GPGPUs and four Intel Xeon Phi 7120P coprocessors. Normally, this system is built either with eight GPGPUs as CXT9000, or outfitted with eight Xeon Phi coprocessors as CXP9000. However, the ``heterochromic'' (i.e., featuring both Nvidia's and Intel's accelerators) configuration that we produced for this project allowed the customer to benchmark the ANSYS software on both the Nividia Tesla and Intel Xeon Phi platforms with minimal logistic effort. Indeed, the software had to be installed only once, and the benchmark scripts and data collection scripts could all be retained in one place.

The methodology of the study was developed by Peter Newman, who also executed the benchmarks, collected and analyzed the data, and summarized findings in a comprehensive report. Mr. Jason Zbick of SimuTech Group, an ANSYS distributor, participated in the study and provided support for ANSYS Mechanical installation and configuration. Colfax's involvement included custom system configuration, maintenance of secure remote access to the system and assistance with automated result collection.

Click here to learn more


Parallel Computing In The Search For New Physics At Large Hadron Collider (LHC)

Posted: December 02, 2013

In the past few months Colfax has had the pleasure of collaborating with Prof. Valerie Halyo of Princeton University on modernization of a high energy physics application for the needs of the Large Hadron Collider (LHC). The objective of the project is to improve the performance of the trigger at LHC, so as to enable real-time detection of exotic collision event products, such as black holes or jets.

For the numerical algorithm of the new trigger software, the Hough transform was chosen. This method allows fast detection of straight or curved tracks in a set of points (detector hits), which could be the traces of new exotic particles. The nature of numerical transform is highly parallelizable, however, existing implementations did not use hardware parallelism or used it sub-optimally.

Colfax's role in the project was to optimize a thread-parallel implementation of the Hough transform for multi-core processors. The result of our involvement was a code capable of detecting 5000 tracks in a synthetic dataset 250x faster than prior art, on a multi-core desktop CPU. By benchmarking the application on a server based on multi-core Intel Xeon E5 processors, we obtained a yet 5x greater performance. The techniques used for optimization, briefly discussed in the report paper, are featured in our book on parallel programming and in our developer training program. They focus on code portability across multi- and many-core platforms, with the emphasis on future-proofing the optimized application.

Our results are reported in a publication submitted for peer review to JINST. Prof. Halyo's work was also featured in an article in International Journal of Innovation, available for download (courtesy of Prof. Halyo).

Click here to learn more


Accelerating Public Domain Applications: Lessons From Models Of Radiation Transport In The Milky Way Galaxy

Posted: November 25, 2013

Last week Andrey Vladimirov, PhD, Head of HPC Research at Colfax International, had the privilege of giving a talk at the Intel Theater at SC'13. He presented a case study done with Stanford University on using Intel Xeon Phi coprocessors for accelerating a new astrophysical library HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity).

If this talk can be summarized in one sentence, that will be "One high performance code for two platforms is reality". Indeed, the optimizations performed in order to optimize HEATCODE for the MIC architecture lead to a tremendous performance increase on the CPU platform. As a consequence, Colfax has developed a high performance library which can be employed and modified both by users who have access to Xeon Phi coprocessors, and by those only using multi-core CPUs.

The paper introducing HEATCODE library with details of the optimization process is under review at Computer Physics Communications. The preliminary manuscript can be obtained from arXiv, and the slides of the talk are available on this page (see links above and below). The open source code will be made available upon the acceptance of the paper.

Click here to learn more


Heterogeneous Clustering With Homogeneous Code: Accelerate MPI Applications Without Code Surgery Using Intel® Xeon Phi™™ Coprocessors

Posted: October 17, 2013

This paper reports on our experience with a heterogeneous cluster execution environment, in which a distributed parallel application utilizes two types of compute devices: those employing general-purpose processors, and those based on computing accelerators known as Intel Xeon Phi coprocessors.

Unlike general-purpose graphics processing units (GPGPUs), Intel Xeon Phi coprocessors are able to execute native applications. In this mode, the application runs in the coprocessor's operating system, and does not require a host process executing on the CPU and offloading data to the accelerator (coprocessor). Therefore, for an application in the MPI framework, it is possible to run MPI processes directly on coprocessors. In this case, coprocessors behave like independent compute nodes in the cluster, with an MPI rank, peer-to-peer communication capability, and access to a network-shared file system. With such configuration, there is no need to instrument data offload in the application in order to utilize a heterogeneous system comprised of processors and coprocessors. That said, an MPI application designed for a CPU-only cluster can be used on coprocessor-enabled clusters without code modification.

We discuss the issues of portable code design, load balancing and system configuration (networking and MPI) necessary in order for such a setup to be efficient. An example application used for this study carries out a Monte Carlo simulation for Asian option pricing. The paper includes the performance metrics of this application with CPU-only and heterogeneous cluster configurations.

Click here to learn more


Multithreaded Transposition of Square Matrices with Common Code for Intel® Xeon® Processors and Intel® Xeon Phi™™ Coprocessors

Posted: August 12, 2013

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of the transposition algorithm efficiency.

This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition. For large matrices, it achieves a transposition rate of 49 GB/s (82% efficiency) on Intel Xeon CPUs and 113 GB/s (67% efficiency) on Intel Xeon Phi coprocessors. The code is tuned with pragma-based compiler hints and compiler arguments. Thread parallelism in the code is handled by OpenMP, and vectorization is automatically implemented by the Intel compiler. This approach allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency. For benchmarks, an Intel Xeon Phi 7110P coprocessor is used.

Click here to learn more


How to Write Your Own Blazingly Fast Library of Special Functions for Intel® Xeon Phi™™ Coprocessors

Posted: May 03, 2013

Statically-linked libraries are used in business and academia for security, encapsulation, and convenience reasons. Static libraries with functions offloadable to Intel® Xeon Phi™™ coprocessors must contain executable code for both the host and the coprocessor architecture. Furthermore, for library functions used in data-parallel contexts, vectorized versions of the functions must be produced at the compilation stage.

This white paper shows how to design and build statically-linked libraries with functions offloadable to Intel® Xeon Phi™™ coprocessors. In addition, it illustrates how special functions with scalar syntax (e.g., y=f(x)) can be implemented in such a way that user applications can use them in thread- and data-parallel contexts. The second part of the paper demonstrates some optimization methods that improve the performance of functions with scalar syntax on the multi-core and the many-core platforms: precision control, strength reduction, and algorithmic optimizations.

Click here to learn more


Cache Traffic Optimization on Intel® Xeon Phi™™ Coprocessors for Parallel In-Place Square Matrix Transposition with Intel Cilk Plus and OpenMP

Posted: April 25, 2013

Numerical algorithms sensitive to the performance of processor caches can be optimized by increasing the locality of data access. Loop tiling and recursive divide-and-conquer are common methods for cache traffic optimization. This paper studies the applicability of these optimization methods in the Intel® Xeon Phi™™ architecture for the in-place square matrix transposition operation. Optimized implementations in the Intel Cilk Plus and OpenMP frameworks are presented and benchmarked. Cache-oblivious nature of the recursive algorithm is compared to the tunable character of the tiled method. Results show that Intel® Xeon Phi™™ coprocessors transpose large matrices faster than the host system, however, smaller matrices are more efficiently transposed by the host. On the coprocessor, the Intel Cilk Plus framework excels for large matrix sizes, but incurs a significant parallelization overhead for smaller sizes. Transposition of smaller matrices on the coprocessor is faster with OpenMP.

Click here to learn more


Test-driving Intel® Xeon Phi™™ Coprocessors with a Basic N-body Simulation

Posted: January 07, 2013

Intel® Xeon Phi™™ coprocessors are capable of delivering more performance and better energy efficiency than Intel® Xeon® processors for certain parallel applications. In this paper, Andrey Vladimirov of Stanford University and Vadim Karpusenko of Colfax International, investigate the porting and optimization of a test problem for the Intel Xeon Phi™ coprocessor. The test problem is a basic N-body simulation, which is the foundation of a number of applications in computational astrophysics and biophysics. Using common code in the C language for the host processor and for the coprocessor, they benchmark the N-body simulation. The simulation runs 2.3x to 5.4x times faster on a single Intel Xeon Phi™ coprocessor than on two Intel Xeon E5 series processors. The performance depends on the accuracy settings for transcendental arithmetics. They also study the assembly code produced by the compiler from the C code. This allows to pinpoint some strategies for designing C/C++ programs that result in efficient automatically vectorized applications for Intel Xeon family devices.

Click here to learn more


Auto-Vectorization with the Intel Compilers: is Your Code Ready for Sandy Bridge and Knights Corner?

Posted: March 12, 2012

One of the features of Intel’s Sandy Bridge-E processor released this month is the support for the Advanced Vector Extensions (AVX) instruction set. Codes suitable for efficient auto-vectorization by the compiler will be able to take advantage of AVX without any code modification, with only re-compilation.

This paper explains the guidelines for code design suitable for auto-vectorization by the compiler (elimination of vector dependence, implementation of unit-stride data access and proper address alignment) and walks the reader through a practical example of code development with auto-vectorization. The resulting code is compiled and executed on two computer systems: a Westmere CPU-based system with SSE 4.2 support, and a Sandy Bridge-based system with AVX support. The benefit of vectorization is more significant in the AVX version, if the code is designed efficiently. An ‘elegant’, but inefficient solution is also provided and discussed.

In addition, the paper provides a comparative benchmark of the Sandy Bridge and Westmere systems, based on the discussed algorithm. Implications of auto-vectorization methods for Intel’s future Many Integrated Core technology based on the Knights Corner chip are discussed at the end.

Click here to learn more