First Results from the Parallelisation of CERN's NA48 Simulation Program J. Apostolakis, C. E. Bruschini, P. Calafiura, F. Gagliardi, M. Metcalf, A. Norton, B. Panzer-Steindel CERN, Geneva, Switzerland L. M. Bertolotto, K. J. Peach Department of Physics and Astronomy, University of Edinburgh, UK Abstract. The GP-MIMD2 project aims at demonstrating the effectiveness of using a European massively parallel supercomputer (MEIKO CS-2) in two different scientific production environments. One of its initial goals is the development of a parallel version of NMC, a specialized simulation code (developed by the CERN experiment NA48) which relies on a huge database (shower library) for maximum efficiency. NMC's memory requirements, combined with NA48's need of high statistics on a short timescale, make it a particular interesting candidate for MPP applications. In our staged approach to NMC's parallelisation we decided to start with event level parallelisation (task farming); this was implemented using tasks that communicate via the CHIMP message passing interface. An overview of our initial experience will be given, both with and without access to the shower library. First timing tests and I/O analysis will be described, concluding with plans for future developments. 1. Introduction: The GP-MIMD2 project The ESPRIT project P7255 GP-MIMD21 aims at acquiring experience with European parallel machines and at demonstrating the effectiveness of the massively parallel processing approach in the scientific environment [1]. Partners in this initiative include, on the users' and on the suppliers' sides: - CERN, the European laboratory for Particle Physics based in Geneva, equipped with some of the largest scientific machines in the world and constantly working at the frontier of basic physics. - CERFACS (Centre Europeen de Recherche et Formation Avancee en Calcul Scientifique), located in Toulouse (F), with a wide experience in the development of algorithms and tools for parallel computations. - major Meteorological and Climate centers in France, Germany and the United Kingdom, with a tradition of using vector supercomputers. - Meiko (GB), Parsys (GB) and Telmat (F), major European producers of parallel systems with many years' experience. Platforms of choice are two Meiko CS-2's (Computing Surface 2), general purpose multi-node UNIX-based MIMD systems currently scalable up to 1024 nodes. They employ commodity CPUs and memory components, such as Sun's SuperSPARC ("Viking") chips, with the possible addition of Fujitsu's mVP vector processors. All nodes are interconnected by an original, scalable high-speed packet switched network using a specialized communications processor at each node and a multi-stage crossbar switch (and can have Input/Output subsystems attached to them) [2]. GP-MIMD2's major objective is to demonstrate that European parallel computing technology is mature enough to be used for high performance computing production. Basic physics applications at CERN and meteorological modelling at CERFACS have been chosen as ideal demonstrators because of their high demanding computing requirements and severe industrial production quality standards. 2. Computing in High Energy Physics High energy physics experiments are quite often run by large international collaborations over periods of many years, using state-of-the-art equipment to: - simulate the detector behaviour in order to optimize its design (and cost), and later on to reduce the uncertainty of MonteCarlo calculations. - collect during the running of the experimental detectors huge volumes of raw data, containing for each event2 the response of the single subdetectors (reducing the background in real-time as much as possible). - reconstruct in detail the kinematical and physical features of single events from the raw data, or partially if the decision is going to be taken online. - analyze the reconstructed events, producing distributions of functions for subsets of events. The data itself is usually structured as a succession of events, each independent from the others. It comes, therefore, as no surprise that parallel processing at the event level ("farming") has become a reality during the last five years in the HEP world, implemented in a cost-effective way on clusters of RISC workstations. The future generation of experiments will nevertheless exceed today's computing potential by several orders of magnitude, in terms of CPU as well as I/O capability. This makes the investigation of high performance computing of paramount importance. It will, therefore, probably be necessary to exploit parallelism within the event, considering for example the modularity of the detectors or treating particle tracks separately (since they too are independent) [3]. 3. The CERN Experiment NA48 Work package 3 (WP3) of GP-MIMD2 identified a physics "demonstrator" application, i.e. a promising candidate for running on MPP platforms such as the CS-2, within the experiment NA48 which is foreseen to take data at CERN in early 1996 [4]. Also, it was decided to start with the migration of the CERN Program Library, which contains a set of general purpose programs written and maintained by CERN for the use of the HEP community (since most HEP codes rely on it). NA48 will use special high intensity beams of short-lived neutral kaons (K0), comparing their relative decay rates into two neutral (p0)3 and two charged pions (p+,p-). This will allow the measurement of a very important physical parameter4 with a precision one order of magnitude better than the current value and with minimal systematic error. The results will then be compared with theoretical predictions of the Standard Model of electroweak interactions. The high statistics, high precision character of the experiment implies that it will be necessary to cope with very high data rates due to the huge background. Potentially up to 100 Mbytes/s with several thousand events per second (after the first three trigger levels) are needed in order to be able to collect the necessary amount of "physics" events (some 107 per decay mode). Severe constraints are therefore being put on the data acquisition as well as on the detectors themselves. For example, the homogeneous liquid Krypton (LKr) calorimeter designed to detect neutral decays will be composed of as many as 13000 cells (2x2x135 cm), possessing excellent energy, space and time resolution. Fig. 1. Simulation of shower developments in the LKr calorimeter for a neutral (4 g) and a charged event (p+p-) respectively (the direction of the particles is normal to the page). Each box corresponds to a cell of the calorimeter and its size to the energy therein deposited. In addition, it will be necessary to repeatedly simulate some 107 K0 decays in the active volume of the detector, in order to estimate accurately the efficiency and acceptance (high statistics MonteCarlo). 4. The NA48 Simulation Program NMC and its Shower Library NMC is a specialized and fast detector simulation code developed by the NA48 collaboration [5]. It does not perform directly a full event-by-event simulation of shower depositions in the LKr calorimeter (see above) because this would be excessively time consuming5, relying instead on a separately generated, distributed shower library. Each entry is identified by the particle type, the energy bin to which it belongs and the coordinates of its impact point on the calorimeter surface, and contains the energy deposited by the particle in the cells of a predefined area centered on its impact point. The size of a single shower depends for the time being only on the particle type, amounting to about 1 Kbyte for e+,e- and g and 5 Kbytes for p+,p-. Access time to this huge database, which will have an ultimate size of several Gbytes and contain some 105 entries, has obviously to be kept as small as possible so that NMC will retain its efficiency. Note in particular that the neutral events in which we are interested require access to four photons per event. Fig. 2. Schematic NMC Flowchart The GP-MIMD2 project became interested in NMC given not only the tight overall constraints on I/O and memory (database size, random access), but also the statistics needs described above, which could potentially be met "overnight" on MPP platforms (fast feedback). Possible future real-time applications such as online event reconstruction and calibration of the LKr calorimeter are also particularly appealing. 5. Task-Farming Parallelisation Without Shower Library The first step in our staged approach was NMC event-level parallelisation in its simplest form, i.e. task farming (coarse grain parallelism). This guarantees maximum code flexibility since minimal changes between sequential and parallel version are needed, and this is particularly welcome because NMC is still in an evolution phase. Such an implementation consists of several tasks that communicate with each other via message passing interfaces, i.e.: - (at least) one master program, also called Source, which first initializes the working environment of the message passing interface as well as of the main code itself (setting up databases, etc.). During execution it synchronizes the slave processes (see below) and distributes the relative work loads, handing out "packets" of events to be processed. Note that the treatment of random numbers is quite important since they have to be generated and distributed correctly (i.e. in such a way that each worker will really generate different events), irrespective of the number of processors in the system. - several identical and independent slave programs, or Event Workers, which do the actual event generation and are in fact copies of NMC with very little modification. A starting configuration could, for example, be characterized by having one Event Worker per processor. - (at least) one program, also called Sink, that collects the results of the computations from the slaves (histograms for example) and deallocates resources. This part is in fact still under development. Fig. 3. Task-farming parallelisation scheme (see also the next paragraph) The tasks were interfaced using CHIMP6, developed at the Edinburgh Parallel Computer Centre (EPCC) and built on connectionless datagram services. Scaling tests have been carried out on up to 12 processors on clusters of SGI 340 (four nodes each) or Sun IPX machines, as well as on single multiprocessor machines (SGI Challenge with eight nodes) [7]. The total number of events to be processed was kept fixed, changing the number of processors and of events per request (and consequently the number of messages) in order to estimate the message passing overhead and verify the scaling behaviour. The lowest time required for a single "ping-pong" communication between two processes was found to be 8 msec on the SGI Challenge7, comparable to the time needed to process a single event on the same machine (remember that these tests were carried out without shower library). Deviations from scaling appeared only using a high message passing overhead (10%), thus showing that task-farming is feasible. 6. Task-Farming Parallelisation with Shower Library Thus encouraged we proceeded towards the second step in parallelisation of NMC, i.e. task-farming with random access to a large, distributed shower library. 6.1 Implementation and First Tests An additional category of slave programs, the Shower Workers, was added to the ones described above, one of them being present on each node of the system containing a share of the library. The latter will be distributed over as many nodes as possible in order to fully exploit the advantages of MIMD architectures, and uniformly (with regard to its access pattern) in order to guarantee I/O load balancing. During the processing of an event a worker will arrive at a point where a shower is needed; it will examine an internal database to figure where the desired information is, send a request to the corresponding task and wait. Having received the shower information it will continue until another shower is needed, and so forth. The final version of the shower library is currently being generated; therefore we decided to work using several copies of a smaller shower library (16 Mbytes), running on the Central Simulation Facility cluster of HP workstations at CERN. This multi-user facility does not permit extensive testing nor timing analyses, but it was nevertheless possible to generate the missing code and test it with success. 6.2 I/O Analysis In parallel to the activities previously described, we optimized also the sequential version of NMC, identifying and modifying time-critical routines, and carried out a preliminary I/O analysis on a single processor. For this purpose we created a huge file (1.2 Gbytes), whose access pattern mimics the one of the shower library itself, and measured the mean time spent in database access for neutral events (four showers each, randomly positioned in the file). The result, about 50 msec in total, is slightly larger than the mean time necessary to process a neutral event on a Sun SuperSPARC processor running at 50 MHz. Ways to hide this I/O wait-time include minimizing disk access by distributing the database on as many nodes as possible (trying to exploit fully the main memory), or running several processes, in our case copies of NMC, on the same processor. The last solution, in particular, has been already tested and proven to be very effective. 7. Plans for the Future The NA48 simulation program NMC has been parallelized at the event-level and first test have been carried out. In the near future we will generate a realistic shower library and perform more detailed analyses (timing, scaling, etc.) on dedicated MIMD platforms (a Sun SPARCenter and the CS-2) using CHIMP. First production runs will follow, in order to test on a large scale stability and results and fully assess the potentialities of MPP systems. In the long run we wish to investigate opportunities for finer grain parallelism and real-time applications such as the calibration of NA48's LKr calorimeter. References [1] Project P7255 GP-MIMD2, described in "High Performance Computing and Networking, Summaries of Projects (Synopsis)", Jan. 1993, CEC DG-XIII [2] "Computing Surface Documentation Guide", 1993 Meiko World Incorporated [3] K. J. Peach et al.: "The Ongoing Investigation of High Performance Parallel Computing in HEP", CERN/DRDC 93-10, DRDC P49, January 1993 [4] G. D. Barr et al.: "Proposal for a Precision Measurement of e/e' in CP Violating K0 -> 2p Decays", CERN/SPSC/90-22, SPSC/P253, July 1990 [5] F. Leber et al.: "NMC User's Guide" [6] R. Brun et al.: "GEANT User's Guide", CERN Program Library W5103 [7] B. Panzer-Steindel: "Test of Scalability for Task-Farming with CHIMP", CN-GPM Internal Note, August 1993 Footnotes 1 General-Purpose Multiple Instruction Multiple Data 2. 2 Which represents the "atomic" data unit in HEP and corresponds to the collision of a projectile particle with a target, not necessarily fixed, or with another projectile. 3 Each of them decaying in turn very quickly into two photons. 4 The direct CP violation parameter e'/e. See also [4] and the references contained therein. 5 For example GEANT [6] takes about 10 sec per average energy photon on an HP 9000/735. 6 Common High-level Interface to Message Passing. 7 This value is entirely due to protocol overhead and would obviously be much lower on a CHIMP version for multiprocessor machines (instead of workstation clusters).