CUDA_May_09_TB_L5

download CUDA_May_09_TB_L5

of 51

Transcript of CUDA_May_09_TB_L5

  • 7/30/2019 CUDA_May_09_TB_L5

    1/51

    Lecture 5Multi-GPU computing with CUDA and MPI

    Tobias Brandvik

  • 7/30/2019 CUDA_May_09_TB_L5

    2/51

    The story so far

    Getting started (Pullan) An introduction to CUDA for science (Pullan) Developing kernels I (Gratton) Developing kernels II (Gratton) CUDA with multiple GPUs (Brandvik) Medical imaging registration (Ansorge)

  • 7/30/2019 CUDA_May_09_TB_L5

    3/51

    Agenda

    MPI overview The MPI programming model Heat conduction example (CPU) MPI and CUDA Heat conduction example (GPU) Performance measurements

  • 7/30/2019 CUDA_May_09_TB_L5

    4/51

    MPI overview

    MPI is a specification of a Message Passing Interface The specification is a set of functions with prescribed behaviour Not a library there are multiple competing implementations of the

    specification

    Two popular open-source implementations are Open-MPI andMPICH2

    Most MPI implementations from vendors are customized versions ofthese.

  • 7/30/2019 CUDA_May_09_TB_L5

    5/51

    Why use MPI?

    Performance Scalability Stability

  • 7/30/2019 CUDA_May_09_TB_L5

    6/51

    What hardware does MPI run on?

    Distributed memory clusters MPIs popularity is in large part due to the rise of cheap clusters

    with commodity x86 nodes over the last 15 years

    Ethernet or Infiniband interconnects Shared memory

    Some MPI implementations are also suitable for multi-core sharedmemory machines (e.g. high-end desktops)

  • 7/30/2019 CUDA_May_09_TB_L5

    7/51

    MPI programming model

    An MPI program consists of several processes Each process can execute different instructions Each process has its own memory space Processes can only communicate by sending messages to each other

  • 7/30/2019 CUDA_May_09_TB_L5

    8/51

    MPI programming model

    CPU

    Memory

    CPU

    Memory

    Rank 0 Rank 1

    Communicator Rank: A unique integer identifierfor a process

    Communicator: The collectionof processes which maycommunicate with each other

  • 7/30/2019 CUDA_May_09_TB_L5

    9/51

    A simple example in pseudo-code

    We want to copy an array from one processor to another

    float a[10];float b[10];

    rank 1rank 0

    float a[10];float b[10];

    recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()

    recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()

  • 7/30/2019 CUDA_May_09_TB_L5

    10/51

    A simple example

    We want to copy an array from one processor to another

    float a[10];float b[10];

    rank 1rank 0

    float a[10];float b[10];

    recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()

    recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()

    memory location

  • 7/30/2019 CUDA_May_09_TB_L5

    11/51

    A simple example

    We want to copy an array from one processor to another

    float a[10];float b[10];

    rank 1rank 0

    float a[10];float b[10];

    recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()

    recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()

    memory locationmessagelength

  • 7/30/2019 CUDA_May_09_TB_L5

    12/51

    A simple example

    We want to copy an array from one processor to another

    float a[10];float b[10];

    rank 1rank 0

    float a[10];float b[10];

    recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()

    recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()

    memory locationmessagelength

    datatype

  • 7/30/2019 CUDA_May_09_TB_L5

    13/51

    A simple example

    We want to copy an array from one processor to another

    float a[10];float b[10];

    rank 1rank 0

    float a[10];float b[10];

    recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()

    recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()

    memory locationmessagelength

    datatype

    sending rank

  • 7/30/2019 CUDA_May_09_TB_L5

    14/51

    A simple example

    We want to copy an array from one processor to another

    float a[10];float b[10];

    rank 1rank 0

    float a[10];float b[10];

    recv(b, 10, float, 1, 200)send(a, 10, float 1, 300)wait()

    recv(b, 10, float, 0, 300)send(a, 10, float 0, 200)wait()

    memory locationmessagelength

    datatype

    sending rank

    messagetag

  • 7/30/2019 CUDA_May_09_TB_L5

    15/51

    The only 7 MPI functions youll ever need

    MPI-1 has more than 100 functions But most applications only use a small subset of these In fact, you can write production code using only 7 MPI functions But youll probably use a few more

  • 7/30/2019 CUDA_May_09_TB_L5

    16/51

    The only 7 MPI functions youll ever need

    MPI_Init MPI_Comm_size MPI_Comm_rank MPI_Isend MPI_Irecv MPI_Waitall MPI_Finalize

    The MPI specification is defined for C, C++ and Fortran well considerthe C function prototypes

  • 7/30/2019 CUDA_May_09_TB_L5

    17/51

    A closer look at the functions

    int MPI_Init( int *argc, char ***argv ) Initialises the MPI execution environment

    int MPI_Comm_size ( MPI_Comm comm, int *size ) Determines the size of the group associated with a communicator

    int MPI_Comm_rank ( MPI_Comm comm, int *rank ) Determines the rank of the calling process in the communicator

    int MPI_Finalize() Terminates MPI execution environment

  • 7/30/2019 CUDA_May_09_TB_L5

    18/51

    A closer look at the functions

    int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source,int tag, MPI_Comm comm, MPI_Request *request )

    buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) source: rank of source tag: message tag comm: communicator request: communication request (used for checking message status)

  • 7/30/2019 CUDA_May_09_TB_L5

    19/51

    A closer look at the functions

    int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest,int tag, MPI_Comm comm, MPI_Request *request )

    buf: memory location for message count: number of elements in message datatype: type of elements in message (e.g. MPI_FLOAT) dest: rank of src tag: message tag comm: communicator request: communication request (used for checking message status)

  • 7/30/2019 CUDA_May_09_TB_L5

    20/51

    The structure of an MPI program

    Startup MPI_Init MPI_Comm_size/MPI_Comm_rank Read in and initialise data based on the process rank

    Inner loop Post all receives MPI_Irecv Post all sends MPI_Isend Wait for message passing to finish MPI_Waitall Perform computation

    End Write out data MPI_Finalize

  • 7/30/2019 CUDA_May_09_TB_L5

    21/51

    An actual MPI program

    #include int main(int argc, char *argv[]) {

    MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {

    MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);

    }if (mpi_rank == 1) {

    MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);

    }MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();

    }

  • 7/30/2019 CUDA_May_09_TB_L5

    22/51

    An actual MPI program

    #include int main(int argc, char *argv[]) {

    MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {

    MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);

    }if (mpi_rank == 1) {

    MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);

    }MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();

    }

  • 7/30/2019 CUDA_May_09_TB_L5

    23/51

    An actual MPI program

    #include int main(int argc, char *argv[]) {

    MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {

    MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);

    }if (mpi_rank == 1) {

    MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);

    }MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();

  • 7/30/2019 CUDA_May_09_TB_L5

    24/51

    An actual MPI program

    #include

    int main(int argc, char *argv[]) {MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {

    MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);

    }if (mpi_rank == 1) {

    MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);

    }MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();

  • 7/30/2019 CUDA_May_09_TB_L5

    25/51

    An actual MPI program

    #include

    int main(int argc, char *argv[]) {MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {

    MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);

    }if (mpi_rank == 1) {

    MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);

    }MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();

  • 7/30/2019 CUDA_May_09_TB_L5

    26/51

    An actual MPI program

    #include

    int main(int argc, char *argv[]) {MPI_Request req_in, req_out;MPI_Status stat_in, stat_out;float a[10], b[10];int mpi_rank, mpi_size;MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);if (mpi_rank == 0) {

    MPI_Irecv(b, 10, MPI_FLOAT, 1, 200, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 1, 300, MPI_COMM_WORLD, &req_out);

    }if (mpi_rank == 1) {

    MPI_Irecv(b, 10, MPI_FLOAT, 0, 300, MPI_COMM_WORLD, &req_in);MPI_Isend(a,10, MPI_FLOAT, 0, 200, MPI_COMM_WORLD, &req_out);

    }MPI_Waitall(1, &req_in, &stat_in);MPI_Waitall(1, &req_out, &stat_out);MPI_Finalize();

  • 7/30/2019 CUDA_May_09_TB_L5

    27/51

    Compiling and running MPI programs

    MPI implementations provide wrappers for popular compilers These are normally named mpicc/mpicxx/mpif77 etc. Running an MPI program normally through mpirun np N ./a.out So, for previous example:

    mpicc mpi_example.c mpirun np 2 ./a.out

    These commands are for Open-MPI, others may differ slightly

  • 7/30/2019 CUDA_May_09_TB_L5

    28/51

    Heat conduction example (CPU)

    Well modify the head conduction example from earlier to work withmultiple CPUs

  • 7/30/2019 CUDA_May_09_TB_L5

    29/51

    2D heat conduction

    In 2D:T

    t=

    2T

    x2+

    2T

    y2

  • 7/30/2019 CUDA_May_09_TB_L5

    30/51

    2D heat conduction

    In 2D:

    For which a possible finite difference approximation is:

    where T is the temperature change over a time tand i,j are indices into

    a uniform structured grid (see next slide)

    T

    t=

    2T

    x2+

    2T

    y2

    T

    t=

    Ti+1, j 2Ti, j + Ti1, jx 2

    +

    Ti, j+1 2Ti, j + Ti, j1y 2

  • 7/30/2019 CUDA_May_09_TB_L5

    31/51

    Stencil

    Update red point using data from blue points (and red point)

  • 7/30/2019 CUDA_May_09_TB_L5

    32/51

    Finding more parallelism

    In the previous lectures, we have tried to find enough parallelism in theproblems for 1000s of threads

    This is fine-grained parallelism For MPI, we need another level of parallelism on top of this This is coarse-grained parallelism

  • 7/30/2019 CUDA_May_09_TB_L5

    33/51

    Domain decomposition and halos

  • 7/30/2019 CUDA_May_09_TB_L5

    34/51

    Domain decomposition and halos

  • 7/30/2019 CUDA_May_09_TB_L5

    35/51

    Domain decomposition and halos

  • 7/30/2019 CUDA_May_09_TB_L5

    36/51

    Domain decomposition and halos

  • 7/30/2019 CUDA_May_09_TB_L5

    37/51

    Domain decomposition and halos

    The fictitious boundary nodes are called halos

  • 7/30/2019 CUDA_May_09_TB_L5

    38/51

    Message passing pattern

    The left-most rank sends data to the right The inner ranks send data to both the left and the right The right-most rank sends data to the left

    Rank 0 Rank 1 Rank 2

  • 7/30/2019 CUDA_May_09_TB_L5

    39/51

    Message buffers

    MPI can read and write directly from 2D arrays using an advancedfeature called datatypes (but this is complicated and doesnt work forGPUs)

    Instead, we use 1D incoming and outgoing buffers Message-passing strategy is then:

    Fill outgoing buffers (2D -> 1D) Send from outgoing buffers, receive into incoming buffers Wait Fill arrays from incoming buffers (1D -> 2D)

  • 7/30/2019 CUDA_May_09_TB_L5

    40/51

    Heat conduction example (single CPU)

    for (i=0; i; nstep; i++) {step_kernel();

    }

  • 7/30/2019 CUDA_May_09_TB_L5

    41/51

    Heat conduction example (multi-CPU)

    for (i=0; i; nstep; i++)

    fill_out_buffers();if (mpi_rank == 0) { // left

    receive_right();send_right();

    }if (mpi_rank > 0 && mpi_rank < mpi_size-1) { // inner

    receive_left();receive_right();send_left();send_right();

    }if (mpi_rank == mpi_size-1) { // right

    receive_left();send_left();

    }

    wait_all();empty_in_buffers();step_kernel();

    }

  • 7/30/2019 CUDA_May_09_TB_L5

    42/51

    Heat conduction example (multi-GPU)

    How does all this work when we use GPUs? Just like with CPUs, except we need buffers on both the CPU and the

    GPU

    Use one MPI process per GPU

  • 7/30/2019 CUDA_May_09_TB_L5

    43/51

    Message buffers with GPUs

    Message-passing strategy with GPUs: Fill outgoing buffers on GPU using a kernel (2D -> 1D) Copy buffers to CPU - cudaMemcpy(DeviceToHost) Send from outgoing buffers, receive into incoming buffers Wait Copy buffers to GPU - cudaMemcpy(HostToDevice) Fill arrays from incoming buffers on GPU using a (1D -> 2D)

  • 7/30/2019 CUDA_May_09_TB_L5

    44/51

    Heat conduction example (multi-GPU)

    for (i=0; i; nstep; i++)fill_out_buffers_cpu();recv();send();wait();empty_in_buffers_cpu();

    step_kernel_cpu();

    }

  • 7/30/2019 CUDA_May_09_TB_L5

    45/51

    Heat conduction example (multi-GPU)

    for (i=0; i; nstep; i++)fill_out_buffers_gpu(); // (2D -> 1D)cudaMemcpy(DeviceToHost);recv();send();wait();cudaMemcpy(HostToDevice);empty_in_buffers_gpu(); // (1D -> 2D)

    step_kernel_gpu();}

  • 7/30/2019 CUDA_May_09_TB_L5

    46/51

    Compiling code with CUDA and MPI

    Can use a .cu file and use nvcc like before, but need to include the MPIheaders and library:

    nvcc mpi_example.cu I $HOME/open-mpi/includeL $HOME/open-mpi-lib lmpi

    Or, compile C code with mpicc and CUDA code with nvcc and link theresults together into an executable

    For simple examples, the first approach is fine, but for complicatedapplications the second approach is cleaner

  • 7/30/2019 CUDA_May_09_TB_L5

    47/51

    Scaling performance

    When benchmarking MPI applications, we look at two issues: Strong scaling how well does the application scale with multiple

    processors for a fixed problem size?

    Weak scaling how well does the application scale with multipleprocessors for a fixed problem size per processor?

  • 7/30/2019 CUDA_May_09_TB_L5

    48/51

    GPU scaling issues

    Achieving good scaling is more difficult with GPUs for two reasons:1. There is an extra memory copy involved for every message2. The kernels are much faster so the MPI communication becomes

    a larger fraction of the overall runtime

  • 7/30/2019 CUDA_May_09_TB_L5

    49/51

    Typical scaling experience

    Performance

    Procs Procs

    Performance

    Weak scaling Strong scaling

    Ideal

    CPUGPU

  • 7/30/2019 CUDA_May_09_TB_L5

    50/51

    GPU scaling issues

    Achieving good scaling is more difficult with GPUs for two reasons:1. There is an extra cudaMemcpy() involved for every message2. The kernels are much faster so the communication becomes a

    larger fraction of the overall runtime

  • 7/30/2019 CUDA_May_09_TB_L5

    51/51

    Summary

    MPI is a good approach to parallelism on distributed memory machines It uses an explicit message-passing model Grid problems can be solved in parallel by using halo nodes You dont need to change your kernels to use MPI, but you will need toadd the message passing logic Using MPI and CUDA together can be done by using both host and

    device message buffers

    Achieving good scaling is more difficult since the kernels are faster onthe GPU