The PIAF system is based on the PAW (Physics Analysis Workstation) data analysis and visualization package also developed at CERN [2]. Whereas PAW is basically intended to run stand-alone on a variety of platforms ranging from small microcomputers to mainframes, PIAF is designed to interface with PAW to give the user transparent access to high-performance computing facilities which are physically located elsewhere. This enables the user to carry out the most time-consuming and data-intensive computations in parallel on a PIAF server which typically consists of a cluster of fast workstations with large disks and interconnected by a fast network.
PIAF uses a client-server/multiple slave server model in managing the parallel computation. When a user initially connects to PIAF from his or her local PAW session, a new PIAF master process is started by the inetd super-server. The master in turn initiates slave processes in all PIAF machines and starts waiting for further commands from the user. All communication between PIAF processes is carried out using TCP/IP on Berkeley sockets.
The user can access and manipulate files on the remote PIAF system and view histograms and graphical plots as if all the work were carried out on his or her local workstation. When for example a command for constructing a histogram from a data set is received by the PIAF master, it subdivides the task to all the slaves running in different machines. Thereafter, each slave goes on processing its portion of the data which is mostly independent of the data of the other slaves. After finishing computation each slave reports back to the master who in turn summarizes the results and sends them back to the client.
PAW and PIAF use powerful data constructs called n-tuples for data storage. An n-tuple is basically a data matrix with rows representing events and columns holding individual variables. Various statistical operations can be easily carried out on n-tuples based on any combination of events or variables. N-tuples are stored columnwise which allows access to an individual column without having to read the complete data base. Column-wise n-tuples are very efficient for analyzing only certain variables over the entire data-set, the task most frequently performed.
Histogramming operations usually require multiple passes over the same data. Together with using column-wise n-tuples this allows one to optimize subsequent accesses to the same data by caching the n-tuple data in memory. PAW and PIAF have an adjustable size n-tuple cache which is always checked when accessing n-tuples in case the required data already resides in the cache. The default maximum size for the cache in PIAF is 54MB per slave server.
Sizes of the user data-sets analyzed with PIAF vary from a few megabytes to several gigabytes. The usual size for a large n-tuple file is 200 megabytes (due to the size of the standard tape in use at CERN), but many such files can be chained to form a ``super-n-tuple'' which is then seen as one logical data-set by PAW and PIAF. Analysis of these data sets is usually carried out iteratively requiring multiple passes over the data.