timdoug's interesting tidbits

Little bits of technical documentation and such. Hopefully helpful.

2010-03-02

What CPUs do the Amazon EC2 High-Memory On-Demand Instances use?

I can't vouch for the Extra Large or Double Extra Large ones, but the Quadruple Extra Large ones use Nehalem-based Gainestown Intel Xeon X5550 @ 2.67GHz.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping        : 5
cpu MHz         : 2666.760
cache size      : 8192 KB
physical id     : 0
siblings        : 1core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu tsc msr pae mce cx8 apic mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc
pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca popcnt lahf_lm
bogomips        : 5337.91
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

[/hpc] permanent link

2009-06-01

High-Performance Computing and the PlayStation 3

I purchased a PS3 last summer with the express intent of running some simulations and benchmarks (and playing Gran Turismo 5 Prologue), but never got around to it (the first part, that is).

A few interesting links and PDFs:

I'm intrigued by CUDA as well. We'll see where that goes.

[/hpc] permanent link

How to Install and Run HPL with GotoBLAS on Linux

GFLOPs are fun. Here's how to determine your cluster's performance.

  1. (make sure you have a proper build system: gcc, make, etc. Debian: apt-get install build-essential)
  2. Install OpenMPI and gfortran. On Debian, it's as simple as apt-get install gfortran libopenmpi-dev openmpi-bin.
  3. Download and compile GotoBLAS. It's generally lauded as the fastest BLAS for (at least) x86 machines. Compilation is a simple ./quickbuild.64bit (or 32, as appropriate) in the root directory of the tarball.
  4. Download and untar HPL. I used 1.0a -- 2.0 wouldn't compile for me.
  5. Create a Make.[arch] in the root dir of the hpl folder, and configure accordingly. Appripriate example diff against setup/Make.Linux_PII_CBLAS:
    --- setup/Make.Linux_PII_CBLAS	2004-01-22 00:13:11.000000000 -0500
    +++ Make.timdoug	2009-06-01 00:23:29.000000000 -0400
    @@ -61,7 +61,7 @@
     # - Platform identifier ------------------------------------------------
     # ----------------------------------------------------------------------
     #
    -ARCH         = Linux_PII_CBLAS
    +ARCH         = timdoug
     #
     # ----------------------------------------------------------------------
     # - HPL Directory Structure / HPL library ------------------------------
    @@ -81,9 +81,9 @@
     # header files,  MPlib  is defined  to be the name of  the library to be
     # used. The variable MPdir is only used for defining MPinc and MPlib.
     #
    -MPdir        = /usr/local/mpi
    -MPinc        = -I$(MPdir)/include
    -MPlib        = $(MPdir)/lib/libmpich.a
    +MPdir        = /usr
    +MPinc        = -I$(MPdir)/include/mpi
    +MPlib        = -lmpi
     #
     # ----------------------------------------------------------------------
     # - Linear Algebra library (BLAS or VSIPL) -----------------------------
    @@ -92,9 +92,9 @@
     # header files,  LAlib  is defined  to be the name of  the library to be
     # used. The variable LAdir is only used for defining LAinc and LAlib.
     #
    -LAdir        = $(HOME)/netlib/ARCHIVES/Linux_PII
    +LAdir        =
     LAinc        =
    -LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
    +LAlib        = ~/GotoBLAS/libgoto.a
     #
     # ----------------------------------------------------------------------
     # - F77 / C interface --------------------------------------------------
    @@ -156,7 +156,7 @@
     #    *) call the BLAS Fortran 77 interface,
     #    *) not display detailed timing information.
     #
    -HPL_OPTS     = -DHPL_CALL_CBLAS
    +HPL_OPTS     = 
     #
     # ----------------------------------------------------------------------
     #
    @@ -173,7 +173,7 @@
     # On some platforms,  it is necessary  to use the Fortran linker to find
     # the Fortran internals used in the BLAS library.
     #
    -LINKER       = /usr/bin/g77
    +LINKER       = /usr/bin/gcc
     LINKFLAGS    = $(CCFLAGS)
     #
     ARCHIVER     = ar
  6. In hpl/bin/[arch], tweak HPL.dat. Example file:
    HPLinpack benchmark input file
    Innovative Computing Laboratory, University of Tennessee
    HPL.out      output file name (if any)
    6            device out (6=stdout,7=stderr,file)
    1            # of problems sizes (N)
    16384        Ns
    1            # of NBs
    128          NBs
    0            PMAP process mapping (0=Row-,1=Column-major)
    1            # of process grids (P x Q)
    1            Ps
    2            Qs
    16.0         threshold
    1            # of panel fact
    2            PFACTs (0=left, 1=Crout, 2=Right)
    1            # of recursive stopping criterium
    4            NBMINs (>= 1)
    1            # of panels in recursion
    2            NDIVs
    1            # of recursive panel fact.
    2            RFACTs (0=left, 1=Crout, 2=Right)
    1            # of broadcast
    1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
    1            # of lookahead depth
    0            DEPTHs (>=0)
    0            SWAP (0=bin-exch,1=long,2=mix)
    64           swapping threshold
    0            L1 in (0=transposed,1=no-transposed) form
    0            U  in (0=transposed,1=no-transposed) form
    1            Equilibration (0=no,1=yes)
    8            memory alignment in double (> 0)
    This is appropriate for a dual-core, 4GB RAM system. Important values to change:
    • N -- problem size. Start at ~1000 and ramp up until you hit the limit of your RAM. Note: it's quadratic (dimension of the matrices).
    • NB -- block size. 128 works well for me. Others suggest 80, 160, or 256. Experiment.
    • P and Q -- these multiplied should be the number of cores in your cluster. Certain configurations work better than others; do test.
    HPL.dat tweaking is a kind of black magic -- check the tubes for further information.
  7. Run GOTO_NUM_THREADS=1 mpiexec -np [num processes] ./xhpl
Using these instructions, I achieved 5.5 GFLOPs per core and 10 GFLOPs in total on a Pentium D 930 machine, 9.5 GFLOPs per core and 18 GFLOPs on a Core 2 Duo E6700 processor, and 50.6 GFLOPs on an Amazon EC2 "High-CPU Extra Large Instance" (dual quad-core Xeon E5345s):
domU:~/hpl/bin/timdoug# GOTO_NUM_THREADS=1 mpiexec -np 8 ./xhpl
============================================================================
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================
[[[snip]]]
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR01L2C4       20000   128     2     4             105.33          5.064e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0067492 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0065323 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0012449 ...... PASSED
Hooray!

[/hpc] permanent link


© 2006-18 timdoug | email: "me" at this domain
So necessary