Draft tutorial: Building programs

Introduction: compiled languages

Languages like Fortran, C, C++ and Java, to name but a few, share
certain characteristics: you write code in your language of choice but
then you have to build an executable program from that source code.
Other languages are interpreted - the source code is analysed by a
special program and taken as direct instructions. Two very simple
examples of that type of language: Windows batch files and Linux shell
scripts.

In this tutorial we concentrate on the first type of languages, with
Fortran as the main example. One advantage of compiled languages is that
the build process that you need to build an executable program, is used
to transform the human-readable source code into an efficient program
that can be run on the computer.

Let us have a look at a simple example:

program hello
    write(*,*) 'Hello!'
end program hello

This is just about the simplest program you can write in Fortran and it
is certainly a variation on one of the most famous programs. Even though
it is simple to express in source code, a lot of things actually happen
when the executable that is built from this code runs:

* A process is started on the computer in such a way that it can write
to the console - the window (DOS-box, xterm, ...) at which you type the
program's name.
* It writes the text "Hello!" to the console. To do so it must properly
interact with the console.
* When done, it finishes, cleaning up all the resources (memory,
connection to the console etc.) it took.

Fortunately, as a programmer in a high-level language you do not need to
consider all these details. In fact, this is the sort of things that is
taken care of by the build process: the compiler and the linker.


Compiling the source code

The first step in the build process is to compile the source code. The
output from this step is generally known as the object code - a set of
instructions for the computer generated from the human-readable source
code. Different compilers will produce different object codes from the
same source code and the naming conventions are different.

The consequences:

* If you use a particular compiler for one source file, you need to use
the same compiler (or a compatible one) for all other pieces. After
all, a program may be built from many different source files and the
compiled pieces have to cooperate.
* Each source file will be compiled and the result is stored in a file
with an extension like ".o" or ".obj". It is these object files that are
the input for the next step: the link process.

Compilers are complex pieces of software: they have to understand the
language in much more detail and depth than the average programmer. They
also need to understand the inner working of the computer. And then,
over the years they have been extended with numerous options to
customise the compilation process and the final program that will be
built.

But the basics are simple enough. Take the gfortran compiler, part of
the GNU compiler collection. To compile a simple program as the one
above, that consists of one source file, you run the following command:

$ gfortran -c hello.f90

(assuming the source code is stored in the file "hello.f90")

This results in a file "hello.o" (as the gfortran compiler uses ".o" as
the extension for the object files).

The option "-c" means: only compile the source files. If you were to
leave it out, then the default action of the compiler is to compile the
source file and start the linker to build the actual executable program.
The command:

$ gfortran hello.f90

results in an executable file, "a.out" (on Linux) or "a.exe" on
Windows.

Some remarks:

* The compiler may complain about the contents of the source file, if it
finds something wrong with it - a typo for instance or an unknown
keyword. In that case the compilation process is broken off and you will
not get an object file or an executable program. For instance, if
the word "program" was inadvertently typed as "prgoram":

$ gfortran hello3.f90
hello.f90:1:0:

    1 | prgoram hello
      |
Error: Unclassifiable statement at (1)
hello3.f90:3:17:

    3 | end program hello
      |                 1
Error: Syntax error in END PROGRAM statement at (1)
f951: Error: Unexpected end of file in ‘hello.f90’

Using this compilation report you can correct the source code and try
again.

* The step without "-c" can only succeed if the source file contains a
main program - characterised by the "program" statement in Fortran.
Otherwise the link step will complain about a missing "symbol":

$ gfortran hello2.f90
/usr/lib/gcc/x86_64-pc-cygwin/9.3.0/../../../../x86_64-pc-cygwin/bin/ld: /usr/lib/gcc/x86_64-pc-cygwin/9.3.0/../../../../lib/libcygwin.a(libcmain.o): in function `main':
/usr/src/debug/cygwin-3.1.4-1/winsup/cygwin/lib/libcmain.c:37: undefined reference to `WinMain'
/usr/src/debug/cygwin-3.1.4-1/winsup/cygwin/lib/libcmain.c:37:(.text.startup+0x7f): relocation truncated to fit: R_X86_64_PC32 against undefined symbol `WinMain'
collect2: error: ld returned 1 exit status

The file "hello2.f90" is almost the same as the file "hello.f90", except
that the keyword "program" has been replaced by "subroutine".

The above examples of output from the compiler will differ per compiler
and platform on which it runs. These examples come from the gfortran
compiler running in a Cygwin environment on Windows.

Compilers also differ in the options they support, but in general:

* Options for optimising the code - resulting in faster programs or
smaller memory footprints.
* Options for checking the source code - checks that a variable is not
used before it has been given a value, for instance or checks if some
extension to the language is used.
* Options for the location of include or module files, see below.
* Options for debugging.


Linking the pieces

Almost all programs, except for the simplest, are built up from
different pieces. We are going to examine such a situation in
more detail.

Here is a general program for tabulating a function (source code in
"tabulate.f90"):

program tabulate
    use function

    implicit none
    real    :: x, xbegin, xend
    integer :: i, steps

    write(*,*) 'Please enter the range (begin, end) and the number of steps:'
    read(*,*)  xbegin, xend, steps

    do i = 0,steps
        x = xbegin + i * (xend - xbegin) / steps
        write(*,'(2f10.4)') x, f(x)
    enddo
end program tabulate

Note the use statement - this will be where we define the function f.

We want to make the program general, so keep the
specific source code - the implementation of the function f -
separated from the general source code. There are several ways to
achieve this, but one is to put it in a different source file. We can
give the general program to a user and they provide a specific source code.

Assume for the sake of the example that the function is implemented in a
source file "function.f90" as:

module function
    implicit none
contains

real function f( x )
    real, intent(in) :: x

    f = x - x**2 + sin(x)

end function f
end module function

To build the program with this specific function, we need to compile two
source files and combine them via the link step into one executable
program. Because the program "tabulate" depends on the module
"function", we need to compile the source file containing our module
first. A sequence of commands to do this is:

$ gfortran -c function.f90
$ gfortran tabulate.f90 function.o

The first step compiles the module, resulting in an object file
"function.o" and a module intermediate file, "function.mod". This module
file contains all the information the compiler needs to determine that
the function f is defined in this module and what its interface is. This
information is crucial: it enables the compiler to check that you call
the function in the right way. It might be that you made a mistake and
called the function with two arguments in stead of one. If the compiler
does not know aything about the function's interface, then it can not
check anything.

The second step invokes the compiler in such a way that:

* it compiles the file "tabulate.f90" (using the module file);
* it combines the object files tabulate.o and function.o into an
executable program - with the default name "a.out" or "a.exe" (if you
want a different name, use the option "-o").

What you do not see in general is that the linker also adds a number of
extra files in this link step, the run-time libraries. These run-time
libraries contain all the "standard" stuff - low-level routines that do
the input and output to screen, the sine function and much more.

If you want to see the gory details, add the option "-v". This instructs
the compiler to report all the steps that are in detail.

The end result, the executable program, contains the compiled source
code and various auxiliary routines that make it work. It also contains
references to so-called dynamic run-time libraries (in Windows: DLLs, in
Linux: shared objects or shared libraries). Without these run-time
libraries the program will not start.


Run-time libraries

To illustrate that even a simple program depends on external run-time
libraries, here is the output from the "ldd" utility that reports such
dependencies:

$ ldd tabulate.exe
        ntdll.dll => /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll (0x7ff88f2b0000)
        KERNEL32.DLL => /cygdrive/c/WINDOWS/System32/KERNEL32.DLL (0x7ff88e450000)
        KERNELBASE.dll => /cygdrive/c/WINDOWS/System32/KERNELBASE.dll (0x7ff88b9e0000)
        cygwin1.dll => /usr/bin/cygwin1.dll (0x180040000)
        cyggfortran-5.dll => /usr/bin/cyggfortran-5.dll (0x3efd20000)
        cygquadmath-0.dll => /usr/bin/cygquadmath-0.dll (0x3ee0b0000)
        cyggcc_s-seh-1.dll => /usr/bin/cyggcc_s-seh-1.dll (0x3f7000000)

... To continue ...


Include files and modules

Managing libraries (static and dynamic libraries)

Build tools