Implementation on Massively Parallel Architectures

I have implemented my implicit finite element codes on the Thinking Machines CM-5 under the data-parallel programming model, and on the Cray T3D under the message-passing programming model.

Even though the implementations on the two parallel supercomputers use different programming models, my implementation of the finite element method on them is similar on both. There are three levels of parallelization. Each node, element and equation in the finite element mesh and equation system is assigned its own processor (virtual processor) under the data-parallel model, and groups of nodes, elements and equations are explicitly assigned to individual processors under the message-passing model. By creating these three parallel levels, data that concerns only a particular parallel unit (like node, element or equation) will lie on the same processor, and computations that involve only a particular parallel unit (like the formation of the equation system at the element level) will not require any communication between processors. The assignment of elements to processors is performed by using mesh partitioning techniques which place pieces of the finite element mesh onto individual processors. This type of assignment will optimize the communication between the element and equation levels.

When communication is required between the element and equation levels (performed many times in the GMRES iterative solver), I use a two step gather and scatter operation where the data exchange is facilitated by an intermediate, processor level equation vector. On the CM-5, these communication routines are available in the CMSSL library, and on the T3D, these libraries have been written at the AHPCRC using PVM send and receive functions.

Many of the finite element codes in my research group at the AHPCRC are running between 10 and 12 Gigaflops on the 512 node CM-5, and we find similar per-processor speeds on the T3D with much scalar code optimization.

I have also implemented my finite element codes on the shared memory architecture of the parallel SGI supercomputers. Even though these computers do not generally have the performance capability in terms of memory and flop rate as the more expensive supercomputers (this is rapidly changing), the price per performance of these computers is rather high.