Sun Studio 10 Compilers - EARLY ACCESS FAQ

Last updated: 4 November 2004

Resources:

Note that HTML documentation (man pages and readme files) installed with the Sun Studio 10 Early Access bits can be found at file:/opt/SUNWspro/docs (if the product is installed in /opt). Additional information has also been provided:

Combined readme for Sun Studio 10 Early Access compilers and tools gives latest information on new and changed features for all the Sun Studio components: readme
See also the release notes for this Early Access release.
Documentation for the current release, Sun Studio 9: docs
Additional developer resources: Sun Developer Portal

FAQ:

General Topics
Getting Started on Migration to 64-bit x86 platforms
AMD64 ABI Questions
Performance Questions
- What kind of performance improvement will I see from Sun Studio 10?
- I heard there was stunning improvement on the STREAM benchmark. How much was it and how did you get it?
Debugging on 64-bit Solaris 10 x86 platforms
Tuning 64-bit x86 applications
Porting from 64-bit SPARC V9 to AMD64
- Why does passing an int where a long was expected work on SPARC V9 but not AMD64?

General Topics

What is in Sun Studio 10 and why should I care about it?

Sun Studio 10 provides several major new features:

64-bit compiler support for the Solaris x86 platform
improved feature parity between Solaris OS SPARC and x86 platforms
improved code generation for AMD Opteron processors for both 32 and 64 bits architectures
support for C++ template template parameters
numerous performance improvements and bug fixes

The x86 Solaris 10 platform not only provides 64-bit addressing, but also provides improved performance for many applications that would otherwise work well in 32-bit mode.

Sun Studio compiler features available on Solaris OS SPARC platforms are now available on Solaris OS x86 platforms, for both 32 and 64 bits. These include:

OpenMP 2.0 in C, C++ and Fortran (-xopenmp)
automatic parallelization (-xautopar)
thread-local storage in C and C++ (__thread)
interprocedural optimization (-xipo)
profile-feedback optimization (-xprofile)
loop dependence analysis and transforms (-xdepend)
loop vectorization (-xvector)
restricted parameters (-xrestrict)
type-based alias analysis (-xalias_level)
memory prefetching (-xprefetch)
Sun Performance library (-xlic_lib=sunperf, -library=sunperf)
non-standard floating-point computation (-fns)

Is the entire toolset being offered on 64-bit x86 Solaris platforms?

Yes. C, C++ and Fortran compilers offer -xarch=amd64 mode for compiling for the AMD platform. Additionally, there is support in dbx and performance tools to analyze 64-bit binaries. Math and performance libraries are specifically tuned for AMD64 architecture. The assembler and disassembler tools have also been extended to understand new instructions and exploit new hardware. The rest of the toolset remains largely unchanged.

How do I know if the platform I'm using is running the 64-bit kernel?
At a Solaris 10 shell prompt, run the command isainfo. You should see:

amd64 i386

You will see amd64 only if you are running the 64-bit kernel.

Getting Started on Migration to 64-bit x86 Platforms

I dont even know where to start. Can you help?

Here are some quick references to get ready/motivated:

Extending x86 to 64bits: http://www.devx.com/amd/Article/16101

What's So Great for Developers About the AMD64?: http://www.devx.com/amd/Article/16018

Also, see the Solaris 10 64-bit Developer's Guide

What is the best way to port to 64-bit x86?

If you are moving an application to Solaris 10 for x86 platforms for the first time (in 32-bit mode), use Studio 9 to compile and develop on Solaris 8, 9, or 10 platforms. Do the final build with Sun Studio 10 compilers. You might see a substantial performance improvement with Sun Studio 10 compilers, both for 32-bit and 64-bit code.

If you are moving from 64-bit SPARC V9, it's a straight recompile (with some of the caveats listed here).

Read Chapter 8 Converting Applications for a 64-Bit Environment in the Sun Studio 9 C User's Guide. This will be updated for 64-bit x86 with the final release of Sun Studio 10.

Also, see the Solaris 10 64-bit Developer's Guide

Will I need to recompile my SPARC code for the new 64-bit Solaris x86 systems?

Yes. The AMD Opteron 64-bit instruction set is very different from the SPARC instruction set.

What options do I use to compile 64-bit on my Opteron?

Use -xarch=amd64. You can also use -xarch=generic64, which is available on SPARC also. So you can use the same option in your makefiles for compiling codes on 64-bit x86 and 64-bit SPARC V9 processors.

For the latest Sun Studio 10 compiler option information, see the combined readme

Will I need to recompile my 32-bit Linux application for64-bit Solaris x86 platforms?

Janus will enable Linux applications to run on Solaris platforms, unchanged.

Will I be able to link Linux and 32-bit Solaris x86 code together?

No.

What does -fast expand to when compiling on x86 platforms compared with SPARC platforms?

The -fast option is a macro that can be effectively used as a starting point for tuning an executable for maximum runtime performance. -fast is a macro that can change from one release of the compiler to the next and expands to options that are target platform specific. Compile with the -# option or -xdryrun to examine the expansion of -fast, and incorporate the appropriate options of -fast into the ongoing process of tuning the executable.

Note that to compile a 64-bit x86 object with -fast you need to follow the -fast option with -xarch=amd64 on the command line. (Why? See the next item.)

	x86	SPARC
cc	`-D__MATHERR_ERRNO_DONTCARE -dalign -fns -nofstore -fsimple=2 -fsingle -xarch=sse2 -xbuiltin=%all -xcache=64/64/2:1024/64/8 -xchip=opteron -xlibmil -xlibmopt -xO5`	`-D__MATHERR_ERRNO_DONTCARE -fns -fsimple=2 -fsingle -xalias_level=basic -xarch=v8plusa -xbuiltin=%all -xcache=16/32/1:4096/64/1 -xchip=ultra2 -xdepend -xlibmil -xlibmopt -xmemalign=8s -xO5 -xprefetch=auto,explicit`
CC	`-xO5 -xarch=sse2 -xcache=64/64/2:1024/64/8 -xchip=opteron -fsimple=2 -fns=yes -ftrap=%none -xlibmil -xlibmopt -xbuiltin=%all -nofstore`	`-xO5 -xarch=v8plusa -xcache=16/32/1:4096/64/1 -xchip=ultra2 -xmemalign=8s -fsimple=2 -fns=yes -ftrap=%none -xlibmil -xlibmopt -xbuiltin=%all`
f95	`-xO5-xarch=sse2 -xcache=64/64/2:1024/64/8 -xchip=opteron -dalign -fsimple=2 -fns=yes -ftrap=common -xlibmil -xlibmopt -nofstore`	`-xO5 -xarch=v8plusa -xcache=16/32/1:4096/64/1 -xchip=ultra2 -xdepend=yes -xpad=local -xvector=yes -xprefetch=auto,explicit -dalign -fsimple=2 -fns=yes -ftrap=common -xlibmil -xlibmopt -fround=nearest`

Why must -xarch=amd64 also be specified following -fast?

Compiling with -fast on an 64-bit x86 (AMD64) platform is not sufficient to generate 64-bit code. You must also specify -xarch=amd64. Here's why:

The -xarch option is evaluated from left to right on the command line, so the last specification of -xarch appearing on the command line determines which value of -xarch will be used.

-fast is a macro option whose expansion includes -xtarget=native. However, even on an AMD64 platform, -xtarget=native will expand to -xarch=sse2, which is a 32-bit architecture. You also need to explicitly follow -fast on the command line with -xarch=amd64 to signal 64-bit code generation.

Be aware that the order of these two options is important. Specifying -xarch=amd64 -fast would expand to -xarch=amd64 -xarch=sse2 which still would result in 32-bit code generation. Specifying -fast -xarch=amd64 would expand to -xarch=sse2 -xarch=amd64, which would correctly signal 64-bit code generation.

AMD64 ABI Questions

Where can I find the AMD64 ABI?

http://www.x86-64.org/documentation (currently at version 0.92)

What, in short, is unique to this ABI that I should care about?

To summarize:

There are 8 new integer registers, 16 new FP (XMM) registers
Integer and XMM registers participate in parameter passing conventions making it faster to call routines, than on 32-bit x86 where all parameters were passed via the memory stack.
Efficient (and different) alignment of data types (long, pointer, uint64)
Small structs are optimized to pass and return in registers
varargs are defined differently (user code implication only if the code does type-spoofing)
There are four code models: small, kernel, medium and large

Frame pointers can be optimised away, so they are optional. There is a separate eh_frame mechanism to deal with stack unwind
Finally a note: at the compiler level, the ABI will be common between Solaris OS and Linux, thereby allowing a greater level of interoperability than in the past.

Will I need to recompile my 64-bit Linux code for 64-bit Solaris 10 x86 platforms?

Our goal is binary compatibility between Linux and Solaris for the 64-bit AMD Opteron instruction set over a useful range of programs.

We are not yet at our goal, but we are working closely with AMD, Linux, and Solaris developers to produce a common Application Binary Interface (ABI). This document will likely result in changes to Linux, so you may need to upgrade to a newer version of Linux to get binary compatiblity.

Note, however, that ABI compatibility has limitations when files appear in different places within the file system. Furthermore, Solaris is POSIX compliant and Linux is not. So, binary compatibility will only be effective if programmers code to the common subset of Linux and Solaris.

Will I be able to link Linux and 64-bit Solaris x86 code together?

Yes, but with the caveats of the previous question.

What are the C data types differences between 32-bit (ILP32) and 64-bit (LP64) x86?

See the table below :

Size and alignment of C types for AMD64 Architecture

C Type	ILP32		LP64
C Type	sizeof (bytes)	Alignment (bytes)	sizeof (bytes)	Alignment (bytes)
Integral
_Bool	1	1	1	1
char signed char	1	1	1	1
unsigned char	1	1	1	1
short signed short	2	2	2	2
unsigned short	2	2	2	2
int signed int enum	4	4	4	4
unsigned int	4	4	4	4
long signed long	4	4	8	4
unsigned long	4	4	8	4
long long signed long long	8	4	8	8
unsigned long long	8	4	8	8
Pointer
any-type * any-type (*) ()	8	4	8	8
Floating Point
float double long double	4 8 12	4 4 4	4 8 16	4 8 16
Complex Types
float _Complex double _Complex long double _Complex	8 16 24	4 4 4	8 16 32	4 8 16
Imaginary Types
float _Imaginary double _Imaginary long double _Imaginary	4 8 12	4 4 4	4 8 16	4 8 16

For more information, including data type sizes and alignment on SPARC platforms, see Appendix F of the Sun Studio 9 C Compiler User's Guide.

Will my binary data files be the same between SPARC and 64-bit x86?

If they are, you got lucky. Look at the 64-bit x86 data types table above and compare that with 64-bit SPARC data types.

Will my binary data files be the same between 32-bit and 64-bit x86?

If your data files consist of an array of one integer type, except long and unsigned long, then the answer is that they probably will be the same. If your data files contain floating point types, structures or unions, they probably won't be the same.

Will I need to recompile my 32-bit x86 code to run it on 64-bit Solaris 10 x86 platforms?

No. 64-bit Solaris 10 x86 OS will run existing 32-bit Solaris x86 binaries without change.

While recompiling is not necessary, many customers will experience a boost in performance when re-compiling to 64-bit x86 code.

So why will I get a performance boost when recompiling for 64-bit x86?

There are several reasons, mostly from performance techniques that the industry has developed after the 32-bit ABI was frozen.

The AMD64 architecture has twice as many registers as 32-bit x86: 16 general registers versus 8 and 32 XMM registers versus 16. The ability of the compiler to keep data in the fastest available location is much improved.

The AMD64 ABI requires types to be aligned on their size, which enables fast loads and stores.

Rather than passing parameters in memory on the stack, the AMD64 ABI passes integer and pointer parameters in general registers and floating-point parameters in XMM registers.

The AMD64 ABI passes and returns small structures in registers. This feature will mostly benefit C++ codes.

Is it possible I would lose performance recompiling my application from 32-bit to 64-bit x86?

Yes. The potential loss of performance comes from three things, heavy use of pointers, heavy use of varargs, and heavy use of stack walkback. It is generally hard to predict whether a specific application will gain or lose performance. Your best bet is to measure the performance when compiled with both 32-bit and 64-bit x86 builds, and then choose the best.

What is the problem with pointers?

Pointers are larger. If your application data is mostly pointers to other data, and you spend most of your execution time waiting on main memory, the increased size of pointers decreases the number of pointers that fit in the cache, and will more likely saturate the bandwidth to memory, thus reducing performance.

What is the problem with varargs?

Varargs processing is relatively slow on 64-bit x86 because arguments are really packed into registers and one needs to track a fair amount of information to get the next parameter from the proper place. Normal non-varargs functions should be faster because of this approach, but the varargs functions themselves will be slower. There's not much you can do about it, so don't worry about it.

What is the problem with stack walkback?

The calling convention needs a lot of information about each function to walk back up the stack. Much of this information is stored in the executable as auxillary information, separate from the actual code. The result is that object files are much larger, often as much as twice as large as they would be on 32-bit x86. Pulling together all the information necessary to walk back up the stack means that C++ exception processing, Java exception processing, POSIX thread cancellation, etc, will be relatively slow.

What is this I hear about the frame pointer?

The AMD64 ABI permits the compiler to reuse the register that normally contains the frame pointer. The reason is that one extra register can sometimes make a significant difference in the speed of loops. Unfortunately, without the frame pointer available and in a consistent location, debugging and performance analysis tools cannot easily follow the chain of function calls. In particular, when the compiler reuses the frame pointer register, dtrace will not work. Dtrace is a Solaris 10 OS facility for whole-system performance analysis. It can help you identify the big problems in system performance. Because this facility is so important, Sun Studio compilers will not reuse the frame pointer by default.

For some applications, particularly benchmarks, the higher-level performance problems that dtrace will help you find have already been eliminated. In these circumstances, reusing the frame pointer register will provide an extra boost of speed. To make this boost more easily available, we reuse the frame pointer register with the -fast option.

How will varargs be different on 64-bit x86? After all, isnt all that stuff invisible to the users?

The AMD64 ABI requires parameters to be passed in specific registers.

So if you pass a double to a long hex printf specifier, it won't work. Example:

#define   L(d)   ((unsigned long long *) &d)[0]

int main () {
    double dval = 132.674;

    /* This technique won't work on AMD64 */
    printf("dval = %5.2f (%llX)\n", dval, dval);

    /* This technique will work on IPL32 and LP64,
       SPARC or x86 */
    printf("dval = %5.2f (%llX)\n", dval, L(dval));

    return 0;
}

amd64% /set/vulcan/lang/intel-S2/bin/cc t.c -xarch=amd64
amd64% ./a.out
dval = 132.67 (FFFFFD7FFFFFF5B8)
dval = 132.67 (406095916872B021)
amd64%

Performance Questions

What kind of performance improvement will I see from Sun Studio 10 compilers?

It will generally vary by the kind of application you have and what hotspots it can present to the compiler to optimize. With SPEC, we expect to see 10% improvement in SPEC INT and about 40% in SPEC FP. Taking advantage of AMD64 hardware, instructions and memory model, can yield 7-20% improvement over 32-bit applications. With the improvements we have added to Sun Studio 10, you might see significant improvements above that threshold. To be fair, the compiler will be in a constant state of improvement up until final release of Sun Studio 10, so exact numbers would be hard to give. Here's some competitive information on SPEC:

Compilers	SPEC INT	SPEC FP
GCC/g77	1369	1001(estimated)
Studio9	1160	1110
Studio10 EA	1301 (estimated)	1365 (estimated)

Notes:

estimated is the SPEC term indicating numbers internally measured and not externally published on SPEC site.
Numbers are likely to be higher on official configuration.
Sun Studio 9 and Sun Studio 10 numbers at this time are 32-bit numbers. Sun Studio 10 EA estimates are expected to improve with the final release of the software.

I heard there was stunning improvement on the STREAM benchmark. How much was it and how did you get it?

We added microvectorization and prefetching to the code generator and it boosted performance by 2x. Here are some competitive numbers; they are roughly on the same kind of box.

STREAM Numbers	Copy	Scale	Add	Triad
GCC	2140	2318	2487	2197
Studio9	2031	2089	237	1913
Studio10/V65x	2586	2454	2495	2517
Studio10/v20z	4717	4635	4275	4349
Studio10/autopar	7905	7396	7169	7220

Debugging on 64-bit Solaris 10 x86 Platforms

dbx

dbx is changing rapidly. For the latest information, see the combined readme

Is there a special version of dbx for 64-bit x86 platforms?

As with Sun Studio compilers on SPARC platforms, we ship two dbx binaries, a 32-bit dbx that can debug 32-bit programs only, and a 64-bit dbx that can debug both 32-bit and 64-bit binaries. On an x86 Solaris system running a 64-bit kernel, the 64-bit dbx is the default.

What works today in dbx?

See the latest information in the combined readme

What can't the 64-bit dbx do yet?

Again, for the latest information, see the combined readme

How do I use the 32-bit dbx on a 64-bit x86 system?

dbx -x exec32 ....

See the combined readme

Tuning 64-bit x86 Applications

Are there tools for tuning 64-bit x86 applications?

The Sun Studio Performance Tools can help find bottlenecks in C, C++, Fortran, and Java applications. In many ways, these tools are more flexible and detailed than prof and gprof. They can help answer the following kinds of questions:

Which source lines and instructions are consuming the most resources?
How did the program arrive at this point in the execution?

For more information about the performance tools in Sun Studio, see the Developer Portal.

How do I use the Performance Tools?

First, record an application's run with Collect, then view and analyze the results with Analyzer. More details can be found at http://developers.sun.com/tools/cc/articles/perftools_tip.html

What are new features for Sun Studio 10?

See the combined readme

Are there limitations for 64-bit x86?

dbx does not yet support the Performance Tools `collector' option.
When profiling Java, call stack walking will be disabled when frame pointers are not supplied by the Java compiler.
The dataspace profiling option is not supported on x86.

See the combined readme

Do I need to compile my application differently?

In general, you don't need to recompile your application. However, the ability to show full call stacks depends on the use of frame pointers. For AMD64 processors, frame pointers are used in C++, but they are disabled for C at higher levels of optimization. You may ensure use of frame pointers by compiling your application with the following options:

C: -Wu,-Z~B
C++ and F95: -Qoption ube -Z~B

Porting from 64-bit SPARC V9 to AMD64

Programs that are already LP64 clean for the most part can just be compiled -xarch=amd64 and should run. Makefiles with SPARC specific compiler options may need to be adjusted.

Why does passing an int where a long was expected work on SPARC V9 but not AMD64?

Prototypes should match function signature:

    With wrong prototype             With correct prototype
    --------------------             ----------------------
    void insert_stc(int);            void insert_stcc(long);

    void string_append() {           void string_append() {
        insert_stc(-1);                  insert_stc(-1);
    }                                }

On SPARC V9 the call to insert_stc will appear to sign extend the argument from int to a long where the wrong prototype has been used. This allows the incorrect program to function as if a correct prototype was in scope. On AMD64 a 4-byte -1 will be passed as specified by the prototype, resulting in zero extension, and incorrect or undefined execution of the program.