The fftMPI library has 4 classes: FFT3d, FFT2d, Remap3d, Remap2d. The FFT classes perform 3d and 2d FFTs. The Remap classes perform a data remap, which communicates and reorders the data that is distributed 3d and 2d arrays across processors. They can be used if you want to only rearrange data, but not perform an FFT.

The basic way to use either of the FFT classes is to instantiate the class and call setup() once to define how the FFT grid is distributed across processors for both input to and output from the FFT. Then invoke the compute() method as many times as needed to perform forward and/or backward FFTs. Destruct the class when you are done using it.

Example code to do all of this for 3d FFTs in C++ is shown below. This code is a copy of the test/simple.cpp file. There are also equivalent files in the test dir for C, Fortran, and Python: simple_c.c, simple_f90.f90, and simple.py.

If a different size FFT or different grid distribution is needed, the FFT classes can be intstantiated as many times as needed.

Using the Remap classes is similar, where the remap() method replaces compute().

The choice to operate on single versus double precision data must be made at compile time, as explained on the compile doc page.

These are the methods that can be called for either the FFT3d or FFT2d classes:

- constructor and destructor
- setup() = define grid size and input/output layouts
- setup_memory() = caller provide memory for FFT to use
- set() = set a parameter affecting how FFT is performed
- tune() = auto-tune paramters that affect how FFT is performed
- compute() = compute a single forward or backward FFT
- only_1d_ffts() = compute just 1d FFTs, no data movement
- only_remaps() = just data movement, no FFTs
- only_one_remap() = just one pass of data movement
- get() = query parameter or timing information

See these apps in the test dir for examples of how all these methods are called from different languages:

- test3d.cpp, test3d_c.cpp, test3d_f90.f90, test3d.py
- test2d.cpp, test2d_c.cpp, test2d_f90.f90, test2d.py

These are methods that can be called for either the Remap3d or Remap2d classes:

- constructor and destructor
- setup() = define grid size and input/output layouts
- set() = set a parameter affecting how Remap is performed
- remap() = perform the the Remap

These files in the text dir of the fftMPI distribution compute a forward/backward FFT on any number of procs. The size of the FFT is hardcoded at the top of the file. Each file is about 150 lines with comments:

- simple.cpp
- simple_c.c
- simple_f90.f90
- simple.py

You should be able to compile/run any of them as as follows:

cd test make simple mpirun -np 4 simple # run C++ example on 4 procs simple_c # run C example on 1 proc mpirun -np 10 simple_f90 # run Fortran example on 10 procs mpirun -np 6 python simple.py # run Python example on 6 procs

You must link to the fftMPI library built for double-precision FFTs. In the simple source codes, you could replace "3d" by "2d" for 2d FFTs.

The C++ code is in the next section.

// Compute a forward/backward double precision complex FFT using fftMPI // change FFT size by editing 3 "FFT size" lines // run on any number of procs

// Run syntax: // % simple # run in serial // % mpirun -np 4 simple # run in parallel

#include#include #include #include

#include "fft3d.h"

using namespace FFTMPI_NS;

// FFT size

#define NFAST 128 #define NMID 128 #define NSLOW 128

// precision-dependent settings

#ifdef FFT_SINGLE int precision = 1; #else int precision = 2; #endif

// main program

int main(int narg, char **args) { // setup MPI

MPI_Init(&narg,&args); MPI_Comm world = MPI_COMM_WORLD;

int me,nprocs; MPI_Comm_size(world,&nprocs); MPI_Comm_rank(world,&me);

// instantiate FFT

FFT3d *fft = new FFT3d(world,precision);

// simple algorithm to factor Nprocs into roughly cube roots

int npfast,npmid,npslow;

npfast = (int) pow(nprocs,1.0/3.0); while (npfast < nprocs) { if (nprocs % npfast == 0) break; npfast++; } int npmidslow = nprocs / npfast; npmid = (int) sqrt(npmidslow); while (npmid < npmidslow) { if (npmidslow % npmid == 0) break; npmid++; } npslow = nprocs / npfast / npmid;

// partition grid into Npfast x Npmid x Npslow bricks

int nfast,nmid,nslow; int ilo,ihi,jlo,jhi,klo,khi;

nfast = NFAST; nmid = NMID; nslow = NSLOW;

int ipfast = me % npfast; int ipmid = (me/npfast) % npmid; int ipslow = me / (npfast*npmid);

ilo = (int) 1.0*ipfast*nfast/npfast; ihi = (int) 1.0*(ipfast+1)*nfast/npfast - 1; jlo = (int) 1.0*ipmid*nmid/npmid; jhi = (int) 1.0*(ipmid+1)*nmid/npmid - 1; klo = (int) 1.0*ipslow*nslow/npslow; khi = (int) 1.0*(ipslow+1)*nslow/npslow - 1;

// setup FFT, could replace with tune()

int fftsize,sendsize,recvsize; fft->setup(nfast,nmid,nslow, ilo,ihi,jlo,jhi,klo,khi,ilo,ihi,jlo,jhi,klo,khi, 0,fftsize,sendsize,recvsize);

// tune FFT, could replace with setup()

//fft->tune(nfast,nmid,nslow, // ilo,ihi,jlo,jhi,klo,khi,ilo,ihi,jlo,jhi,klo,khi, // 0,fftsize,sendsize,recvsize,0,5,10.0,0);

// initialize each proc's local grid // global initialization is specific to proc count

FFT_SCALAR *work = (FFT_SCALAR *) malloc(2*fftsize*sizeof(FFT_SCALAR));

int n = 0; for (int k = klo; k <= khi; k++) { for (int j = jlo; j <= jhi; j++) { for (int i = ilo; i <= ihi; i++) { work[n] = (double) n; n++; work[n] = (double) n; n++; } } }

// perform 2 FFTs

double timestart = MPI_Wtime(); fft->compute(work,work,1); // forward FFT fft->compute(work,work,-1); // backward FFT double timestop = MPI_Wtime();

if (me == 0) { printf("Two %dx%dx%d FFTs on %d procs as %dx%dx%d grid\n", nfast,nmid,nslow,nprocs,npfast,npmid,npslow); printf("CPU time = %g secs\n",timestop-timestart); }

// find largest difference between initial/final values // should be near zero

n = 0; double mydiff = 0.0; for (int k = klo; k <= khi; k++) { for (int j = jlo; j <= jhi; j++) { for (int i = ilo; i <= ihi; i++) { if (fabs(work[n]-n) > mydiff) mydiff = fabs(work[n]-n); n++; if (fabs(work[n]-n) > mydiff) mydiff = fabs(work[n]-n); n++; } } }

double alldiff; MPI_Allreduce(&mydiff,&alldiff,1,MPI_DOUBLE,MPI_MAX,world); if (me == 0) printf("Max difference in initial/final values = %g\n",alldiff);

// clean up

free(work); delete fft; MPI_Finalize(); }