"D. B. Miron" wrote:
>
> I downloaded the .pdf file but my reader had problems
> translating it so I got 7 blank pages. How about a plain
> text version?
Here's the plain text version. Those who want to see the graphs will
still have to read the pdf version.
-- Dave Michelson dmichelson_at_home.com NEC2 Benchmarks Under Linux: Standard NEC2 Versus LAPACK and ASCI Red BLAS Versions I D Flintoft 6/4/99 This note compares the performance of NEC2 running under linux using the built in LU decomposition and back substitution routines to a modified version using routines from LAPACK. Two versions of LAPACK were investigated, one based on the reference BLAS implementation, and one built on the ASCI Red Optimised BLAS library. System Hardware Intel Pentium II at 400MHz on a Supermicro P6DBS motherboard with 128MB PC100 SDRAM (512k pipeline burst L2 cache at half processor speed). OS The Debian GNU/Linux 2.1 (http://www.debian.org) distribution was used with a 2.2.5 linux kernel (http://www.linuxhq.com). The benchmarks were run in single user mode to prevent any problems from system daemons. Compilers Two compilers were compared in the benchmarks. The experimental GNU compiler, egcs, version 1.1.2 (http://egcs.cygnus.com), was used with the compiler options: -O2 -ffast-math -funroll-loops -fomit-frame-pointer -fno-emulate-complex -march=pentiumpro -malign-double The -fno-emulate-complex options forces the compiler to use the complex number support in the gcc back end which is significantly faster than the emulated complex number support in g77. Note that there are potential problems with the gcc complex support which is why it is not used by default. Care must therefore be taken to check that valid code is produced by using this option. The time limited demo of Portland Group Incs. (PGI, http://www.pgroup.com) fortran compiler, version 1.7, was also used with the options: -O2 -tp p6 -Munroll -Mdalign NEC2 Source The NEC2 source code, nec2_src.tar.Z, from the "Unofficial Numerical Electromagnetics Code (NEC) Archives" (http://www.qsl.net/wb6tpu/swindex.html) was used. The following modifications were made to the code: 1. The code was made fully double precision, including all embedded constants. 2. Common blocks were realigned to comply with the FORTRAN77 standard. 3. Minor modifications were made to allow compilation with the GNU fortran compiler. 4. Added ETIME intrinsic to subroutine SECOND for timing information. 5. Other cleanups. Standard NEC2 Standard NEC2 binaries were compiled directly from this modified NEC2 source code with the compiler options detailed above. LAPACK NEC2 The LU decomposition and back substitution routines (FACTRS and SOLVES) in NEC2 were modified to use the routines ZGETRF and ZGETRS from the LAPACK library (http://www.netlib.org/lapack). The LAPACK library was built using the reference BLAS implementation included in the LAPACK source with the same compilation options as used for the NEC2 code. Modification were made to four fortran source files in the LAPACK timing routines to allow compilation with GNU Fortran (which does not accept calls to intrinsic functions in PARAMETER statements). ASCI Red NEC2 The linux ASCI Red PentiumPro Optimised BLAS libraries, version 1.1n, by Greg Henry (http://www.cs.utk.edu/~ghenry/distrib) from the Intel Performance Library Suite (http://developer.intel.com/design/perftoll/perflibst) and other sources were used with LAPACK to build optimised versions of NEC2. Benchmarks The TEST300.NEC, TEST600.NEC and TEST1200.NEC files used in the PC NEC4.1 Performance Data benchmark survey posted to the NEC mailing list were used. These files have 300, 600 and 1200 segments respectively. A new version of the test file with 2000 segments (TEST2000.NEC) was also used. ----------------- Table 1: Details of benchmark results. Fill Time Factor Time Run Time (s) (s) (s) TEST300.NEC NEC4.1 DVF/NT 1.090 0.830 2.040 Standard NEC2, egcs 0.960 0.940 1.990 LAPACK NEC2, egcs 0.960 0.600 1.650 ASCI Red NEC2, egcs 0.970 0.540 1.580 Standard NEC2, PGI 0.800 1.050 1.920 LAPACK NEC2, PGI 0.810 0.720 1.600 ASCI Red NEC2, PGI 0.810 0.310 1.190 TEST600.NEC NEC4.1 DVF/NT 4.130 8.670 13.390 Standard NEC2, egcs 3.500 8.440 12.150 LAPACK NEC2, egcs 3.490 6.240 9.950 ASCI Red NEC2, egcs 3.470 4.680 8.360 Standard NEC2, PGI 2.900 9.330 12.420 LAPACK NEC2, PGI 2.960 7.230 10.380 ASCI Red NEC2, PGI 2.930 2.240 5.360 TEST1200.NEC NEC4.1 DVF/NT 16.560 71.540 89.840 Standard NEC2, egcs 11.010 85.390 97.030 LAPACK NEC2, egcs 11.010 67.110 78.740 ASCI Red NEC2, egcs 11.030 37.810 49.470 Standard NEC2, PGI 9.340 94.310 104.250 LAPACK NEC2, PGI 9.480 79.760 89.790 ASCI Red NEC2, PGI 9.310 16.950 26.740 TEST2000.NEC NEC4.1 DVF/NT Standard NEC2, egcs 26.290 459.030 486.880 LAPACK NEC2, egcs 26.240 317.820 345.520 ASCI Red NEC2, egcs 26.320 175.590 203.370 Standard NEC2, PGI 22.620 516.090 540.230 LAPACK NEC2, PGI 23.040 399.310 423.690 ASCI Red NEC2, PGI 22.440 80.110 103.670 ----------------- Results Table 1 presents the details of the fill time, factor time and total run time for each benchmark using the different versions of NEC2. The results below are compared with the corresponding result from the PC NEC4.1 Performance Data benchmark survey for a 400MHz Pentium II using Digital Visual Fortran under Windows NT. The results are shown graphically in Figures 1 to 4. For standard NEC2 the PGI compiled version is slightly faster than egcs for the matrix fill but somewhat slower at the LU decomposition. Both egcs and PGI are typically 10-15% slower at the factorisation than the NT version of NEC4.1. The variation of the fill time, factor time and total run time with the number of segments is shown in Figures 5 to 7. Table 2 shows the speed-up in the factorisation time, speed-up = reference factor time / factor time, obtained by using the LAPACK libraries with both the reference BLAS implementation and the ASCI Red BLAS for the two compilers. For egcs there is a speed-up of 1.3 to 1.6 using the reference BLAS LAPACK and 1.8 to 2.6 using the ASCI Red BLAS. The performance boost is greater for a larger number of segments. For the PGI compiler the speed-up with the reference BLAS version is 1.2 to 1.5, slightly lower than with egcs. However using the ASCI Red BLAS the PGI compiler gives an improvement of 3.4 to 6.4, much greater than with egcs. ----------------- Table 2: Speed-up factors for LAPACK and ASCI Red BLAS versions of NEC2 relative to the standard version. egcs PGI LAPACK ASCI Red LAPACK ASCI Red TEST300.NEC 1.57 1.74 1.45 3.39 TEST600.NEC 1.35 1.8 1.29 4.17 TEST1200.NEC 1.27 2.26 1.18 5.56 TEST2000.NEC 1.44 2.61 1.29 6.44 ------------------ The far better performance of PGI with the ASCI Red BLAS may be due to the known stack alignment problems with the GNU compilers. Serious performance degradation can result if double precision variables are not aligned on 64-bit boundaries in memory on 686 architectures: even with -malign-double egcs does not always make a good job of doing this. -----Received on Fri Apr 09 1999 - 16:02:40 EDT
This archive was generated by hypermail 2.2.0 : Sat Oct 02 2010 - 00:10:39 EDT