dl.tex

%%% vim:ai:tw=72:
\chapter{Program Loading and Dynamic Linking}

\section{Program Loading}

Program loading is a process of mapping file segments to virtual
memory segments. For efficient mapping executable and shared object
files must have segments whose file offsets and virtual addresses are
congruent modulo the page size.

To save space the file page holding the last page of the text segment
may also contain the first page of the data segment. The last data
page may contain file information not relevant to the running process.
Logically, the system enforces the memory permissions as if each
segment were complete and separate; segments' addresses are adjusted
to ensure each logical page in the address space has a single set of
permissions. In the example above, the region of the file holding the
end of text and the beginning of data will be mapped twice: at one
virtual address for text and at a different virtual address for data.
 
The end of the data segment requires special handling for
uninitialized data, which the system defines to begin with zero
values. Thus if a file's last data page includes information not in
the logical memory page, the extraneous data must be set to zero, not
the unknown contents of the executable file. ``Impurities'' in the
other three pages are not logically part of the process image; whether
the system expunges them is unspecified.

One aspect of segment loading differs between executable files and
shared objects. Executable file segments typically contain absolute
code (see section~\ref{sec_coding_examples} ``Coding Examples'').
For the process to
execute correctly, the segments must reside at the virtual addresses
used to build the executable file. Thus the system uses the {\tt p_vaddr}
values unchanged as virtual addresses.

On the other hand, shared object segments typically contain
position-independent code. This lets a segments virtual address
change from one process to another, without invalidating execution
behavior. Though the system chooses virtual addresses for individual
processes, it maintains the segments' relative positions. Because
position-independent code uses relative addressing between segments,
the difference between virtual addresses in memory must match the
difference between virtual addresses in the file.

%%SUN The following table
%%shows possible shared object virtual address assignments for several
%%processes, illustrating constant rela­tive positioning. The table also
%%illustrates the base address computations.

\subsection{Program header}

The following \xARCH program header types are defined:

\begin{table}[H]
\Hrule
  \caption{Program Header Types}
  \begin{center}
    \begin{tabular}[t]{l|l}
      \multicolumn{1}{c}{Name} & \multicolumn{1}{c}{Value} \\
      \hline
      \texttt{PT_GNU_EH_FRAME} & \texttt{0x6474e550} \\
      \texttt{PT_SUNW_EH_FRAME} & \texttt{0x6474e550} \\
      \texttt{PT_SUNW_UNWIND} & \texttt{0x6464e550}
    \end{tabular}
  \end{center}
\Hrule
\end{table}

\begin{description}
 \item[PT_GNU_EH_FRAME, PT_SUNW_EH_FRAME and PT_SUNW_UNWIND]
      The segment contains the stack unwind tables.
      See Section~\ref{sec_eh_frame} of this document.
      \footnote{
	The value for these program headers have been placed in the
	{\tt PT_LOOS} and {\tt PT_HIOS} (os specific range) in order to adapt
	to the existing GNU implementation.  New OS's wanting to
	agree on these program header should also add it into
	their OS specific range.
      }
\end{description}

\section{Dynamic Linking}

\subsubsection{Dynamic Section}

Dynamic section entries give information to the dynamic linker. Some
of this information is processor-specific, including the interpretation
of some entries in the dynamic structure.

\subsubsection{Global Offset Table (GOT)}
\label{got}

Position-independent code cannot, in general, contain absolute virtual
addresses. Global offset tables hold absolute addresses in private
data, thus making the addresses available without compromising the
position-independence and shareability of a program's text. A program
references its global offset table using position-independent
addressing and extracts absolute values, thus redirecting
position-independent references to absolute locations.
 
If a program requires direct access to the absolute address of a
symbol, that symbol will have a global offset table entry. Because
the executable file and shared objects have separate global offset
tables, a symbol's address may appear in several tables. The dynamic
linker processes all the global offset table relocations before giving
control to any code in the process image, thus ensuring the absolute
addresses are available during execution.

The tables first entry (number zero) is reserved to hold the address
of the dynamic
structure, referenced with the symbol \code{\_DYNAMIC}. This allows a
program, such as the dynamic linker, to find its own dynamic structure
without having yet processed its relocation entries. This is
especially important for the dynamic linker, because it must
initialize itself without relying on other programs to relocate its
memory image. On the \xARCH architecture, entries one and two in the
global offset table also are reserved. 

The \textindex{global offset table} contains 64-bit addresses.

For the large models the GOT is allowed to be up to 16EB in size.

\begin{figure}[H]
\Hrule
\caption{Global Offset Table}
\begin{center}
\fbox{\code{extern Elf64_Addr _GLOBAL_OFFSET_TABLE_ [];}}
\end{center}
\Hrule
\end{figure}

The symbol \code{\_GLOBAL\_OFFSET\_TABLE\_} may reside in the
middle of the {\tt .got} section, allowing both negative and
non-negative offsets into the array of addresses.

\subsubsection{Function Addresses}
\label{function_addresses}

References to the address of a function from an executable
file and the shared objects associated with it might not resolve to
the same value. References from within shared objects will normally be
resolved by the dynamic linker to the virtual address of the function
itself. References from within the executable file to a function
defined in a shared object will normally be resolved by the link
editor to the address of the procedure linkage table entry for that
function within the executable file.

To allow comparisons of function addresses to work as expected, if an
executable file references a function defined in a shared object, the
link editor will place the address of the procedure linkage table
entry for that function in its associated symbol table entry. This
will result in symbol table entries with section index of {\tt SHN_UNDEF}
but a type of {\tt STT_FUNC} and a non-zero {\tt st_value}.  A reference to the
address of a function from within a shared library will be satisfied
by such a definition in the executable.

Some relocations are associated with procedure linkage table
entries. These entries are used for direct function calls rather than
for references to function addresses. These relocations do not use the special
symbol value described above. Otherwise a very tight endless loop
would be created.

\subsubsection{Procedure Linkage Table}
\label{plt}


%This is ia32 enhanced to 64 bits without an explicit GOT
%  register.

% This has been copied from the i386 ABI.
\index{procedure linkage table|(}
Much as the global offset table redirects position-independent address
calculations to absolute locations, the procedure linkage table
redirects position-independent function calls to absolute locations.
The link editor cannot resolve execution transfers (such as function
calls) from one executable or shared object to another.  Consequently,
the link editor arranges to have the program transfer control to
entries in the procedure linkage table.  On the \xARCH architecture,
procedure linkage tables reside in shared text, but they use addresses
in the private global offset table.  The dynamic linker determines the
destinations' absolute addresses and modifies the global offset
table's memory image accordingly.  The dynamic linker thus can
redirect the entries without compromising the position-independence
and shareability of the program's text.  Executable files and shared
object files have separate procedure linkage tables.  Unlike
\intelabi, this ABI uses the same procedure linkage table for both
programs and shared objects (see figure~\ref{small_med_plt}).

\begin{figure}[H]
\Hrule
\caption{Procedure Linkage Table (small and medium models)}
\label{small_med_plt}
\begin{footnotesize}
\begin{verbatim}
  .PLT0: pushq    GOT+8(%rip)                 # GOT[1]
         jmp      *GOT+16(%rip)               # GOT[2]
         nopl     0x0(%rax)
  .PLT1: jmp      *name1@GOTPCREL(%rip)       # 16 bytes from .PLT0
         pushq    $index1
         jmp      .PLT0
  .PLT2: jmp      *name2@GOTPCREL(%rip)       # 16 bytes from .PLT1
         pushq    $index2
         jmp      .PLT0
  .PLT3: ...
\end{verbatim}%$
\end{footnotesize}
\Hrule
\end{figure}

Following the steps below, the dynamic linker and the program
``cooperate'' to resolve symbolic references through the procedure
linkage table and the global offset table.

\begin{enumerate}
\item When first creating the memory image of the program, the dynamic
  linker sets the second and the third entries in the global offset
  table to special values.  Steps below explain more about these
  values.
\item Each shared object file in the process image has its own
  procedure linkage table, and control transfers to a procedure
  linkage table entry only from within the same object file.
\item For illustration, assume the program calls \code{name1}, which
  transfers control to the label \code{.PLT1}.
\item The first instruction jumps to the address in the global offset
  table entry for \code{name1}.  Initially the global offset table
  holds the address of the following \code{pushq} instruction, not the
  real address of \code{name1}.
\item Now the program pushes a relocation index (\textit{index}) on
  the stack. The relocation index is a 32-bit, non-negative index into
  the relocation table addressed by the \codeindex{DT_JMPREL} dynamic
  section entry.  The designated relocation entry will have type
  \codeindexwo{R_X86_64_JUMP_SLOT}, and its offset will specify the
  global offset table entry used in the previous \code{jmp}
  instruction.  The relocation entry contains a symbol table index
  that will reference the appropriate symbol, \code{name1} in the
  example.
% Note that index is sign-extended.  Since the GOT will only have
% 2^29 entries (see chapter 4), this is no problem.
\item After pushing the relocation index, the program then jumps to
  \code{.PLT0}, the first entry in the procedure linkage table.  The
  \code{pushq} instruction places the value of the second global
  offset table entry (GOT+8) on the stack, thus giving the dynamic
  linker one word of identifying information.  The program then jumps
  to the address in the third global offset table entry (GOT+16),
  which transfers control to the dynamic linker.
\item When the dynamic linker receives control, it unwinds the stack,
  looks at the designated relocation entry, finds the symbol's value,
  stores the ``real'' address for \code{name1} in its global offset
  table entry, and transfers control to the desired destination.
\item Subsequent executions of the procedure linkage table entry will
  transfer directly to \code{name1}, without calling the dynamic
  linker a second time.  That is, the \code{jmp} instruction at
  \code{.PLT1} will transfer to \code{name1}, instead of ``falling
  through'' to the \code{pushq} instruction.
\end{enumerate}

The \code{LD_BIND_NOW} environment variable can change the dynamic
linking behavior.  If its value is non-null, the dynamic linker
evaluates procedure linkage table entries before transferring control
to the program.  That is, the dynamic linker processes relocation
entries of type \codeindex{R_X86_64_JUMP_SLOT}
during process initialization.  Otherwise, the dynamic linker
evaluates procedure linkage table entries lazily, delaying symbol
resolution and relocation until the first execution of a table entry.
\index{procedure linkage table|)}

Relocation entries of type \codeindex{R_X86_64_TLSDESC} may also be
subject to lazy relocation, using a single entry in the procedure
linkage table and in the global offset table, at locations given by
\texttt{DT_TLSDESC_PLT} and \texttt{DT_TLSDESC_GOT}, respectively, as
described in ``Thread-Local Storage Descriptors for IA32 and
AMD64/EM64T''\footnote{This document is currently available via
  \url{http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt}}.

For self-containment, \texttt{DT_TLSDESC_GOT} specifies a GOT entry in
which the dynamic loader should store the address of its internal TLS
Descriptor resolver function, whereas \texttt{DT_TLSDESC_PLT}
specifies the address of a PLT entry to be used as the TLS descriptor
resolver function for lazy resolution from within this module.  The
PLT entry must push the linkmap of the module onto the stack and
tail-call the internal TLS Descriptor resolver function.

\subsubsection{Large Models}

In the small and medium code models the size of both the PLT and the GOT
is limited by the maximum 32-bit displacement size.
Consequently, the base of the PLT and the
top of the GOT can be at most 2GB apart.

Therefore, in order to support the available addressing space of 16EB,
it is necessary to extend both the PLT and the GOT. Moreover, the PLT
needs to support the GOT being over 2GB away and the GOT can be over
2GB in size.\footnote{If it is determined that the base of the PLT is
within 2GB of the top of the GOT, it is also allowed to use the same
PLT layout for a large code model object as that of the small and
medium code models.}

The PLT is extended as shown in figure~\ref{final_large_plt}
with the assumption that the GOT address is
in \code{\%r15}\footnote{See Function Prologue.}.

\begin{figure}[H]
\Hrule
\caption{Final Large Code Model PLT} \label{final_large_plt}
\begin{footnotesize}
\begin{verbatim}
.PLT0:  pushq   8(%r15)          # GOT[1]
        jmpq    *16(%r15)        # GOT[2]
        nopl    0x0(%rax,%rax,1)
.PLT1:  movabs  $name1@GOT,%r11  # 16 bytes from .PLT0
        jmp     *(%r11,%r15)
.PLT1a: pushq   $index1          # "call" dynamic linker
        jmp     .PLT0
.PLT2:  ...                      # 21 bytes from .PLT1
.PLTx:  movabs  $namex@GOT,%r11  # 102261125th entry
        jmp     *(%r11,%r15)
.PLTxa: pushq   $indexx
        pushq   8(%r15)          # repeat .PLT0 code
        jmpq    *16(%r15)
.PLTy: ...                       # 27 bytes from .PLTx
\end{verbatim}%$
\end{footnotesize}
\end{figure} 

This way, for the first 102261125 entries, each PLT entry besides
\code{.PLT0} uses only 21 bytes. Afterwards, the PLT entry code changes
by repeating that of .PLT0, when each PLT entry is 27
bytes long.  Notice that any alignment consideration is dropped in order
to keep the PLT size down.

Each extended PLT entry is thus 5 to 11 bytes larger than the small and
medium code model PLT entries.

The functionality of entry .PLT0 remains unchanged from the small and
medium code models.

Note that the symbol index is still limited to 32 bits, which would
allow for up to 4G global and external functions.

Typically, UNIX compilers support two types of PLT, generally through
the options \code{-fpic} and \code{-fPIC}.  When building position-independent
objects using the large code model, only \code{-fPIC} is allowed. Using the
option \code{-fpic} with the large code model remains reserved for future
use.


\subsection{Program Interpreter}

The valid \textindex{program interpreter} for programs conforming to the
\xARCH ABI is listed in Table \ref{interp}, which also contains the
\textindex{program interpreter} used by Linux.

\begin{figure}
  \caption{\xARCH Program Interpreter}
  \label{interp}
  \begin{center}
    \begin{tabular}[t]{l|l|l}
      \multicolumn{1}{c}{Data Model} & \multicolumn{1}{c}{Path} &
      \multicolumn{1}{c}{Linux Path} \\
      \hline
      LP64 & \path{/lib/ld64.so.1} & \path{/lib64/ld-linux-x86-64.so.2} \\
      \hline
      ILP32 & \path{/lib/ldx32.so.1} & \path{/libx32/ld-linux-x32.so.2} \\
    \end{tabular}
  \end{center}
\end{figure}

\subsection{Initialization and Termination Functions}

% copied from ia64
The implementation is responsible for executing the initialization
functions specified by \codeindexwo{DT_INIT}, \codeindexwo{DT_INIT_ARRAY},
and \codeindex{DT_PREINIT_ARRAY} entries in the executable file and
shared object files for a process, and the termination (or
finalization) functions specified by \codeindex{DT_FINI} and
\codeindexwo{DT_FINI_ARRAY}, as specified by the \textit{System V ABI}.
The user program plays no further part in executing the initialization
and termination functions specified by these dynamic tags.

\section {Program Property}

The following processor-specific program property types
\footnote{See {\bf Linux Extensions to gABI} at
\url{https://github.com/hjl-tools/linux-abi}} are defined:

\begin{table}[H]
\Hrule
  \caption{Program Property Type}
  \begin{center}
    \begin{tabular}[t]{l|l}
      \multicolumn{1}{c}{Name} & \multicolumn{1}{c}{Value} \\
      \hline
     \texttt{GNU_PROPERTY_X86_ISA_1_USED} & \texttt{0xc0000000} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_NEEDED} & \texttt{0xc0000001} \\
     \texttt{GNU_PROPERTY_X86_FEATURE_1_AND} & \texttt{0xc0000002} \\
    \end{tabular}
  \end{center}
\Hrule
\end{table}

\begin{description}
 \item[GNU_PROPERTY_X86_ISA_1_USED] Its \code{pr_data} field contains
   a 4-byte integer.  The x86 instruction sets indicated by the
   corresponding bits are used in program.  Their support in the
   hardware is optional.  A bit in the output \code{pr_data} field is
   set if  it is set in any relocatable input \code{pr_data} fields.
 \item[GNU_PROPERTY_X86_ISA_1_NEEDED] Its \code{pr_data} field contains
   a 4-byte integer.  The x86 instruction sets indicated by the
   corresponding bits are used in program and they must be supported
   by the hardware.  A bit in the output \code{pr_data} field is
   set if  it is set in any relocatable input \code{pr_data} fields.
 \item[GNU_PROPERTY_X86_FEATURE_1_AND] Its \code{pr_data} field contains
   a 4-byte integer.  A bit in the output \code{pr_data} field is set
   only if it is set in all relocatable input \code{pr_data} fields.
\end{description}

\begin{table}[H]
\Hrule
  \caption{Bit Flags For X86 Instruction Sets}
  \begin{center}
    \begin{tabular}[t]{l|l}
      \multicolumn{1}{c}{Name} & \multicolumn{1}{c}{Value} \\
      \hline
     \texttt{GNU_PROPERTY_X86_ISA_1_486} & \texttt{1U << 0} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_586} & \texttt{1U << 1} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_686} & \texttt{1U << 2} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_SSE} & \texttt{1U << 3} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_SSE2} & \texttt{1U << 4} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_SSE3} & \texttt{1U << 5} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_SSSE3} & \texttt{1U << 6} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_SSE4_1} & \texttt{1U << 7} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_SSE4_2} & \texttt{1U << 8} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX} & \texttt{1U << 9} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX2} & \texttt{1U << 10} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX512F} & \texttt{1U << 11} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX512CD} & \texttt{1U << 12} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX512ER} & \texttt{1U << 13} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX512PF} & \texttt{1U << 14} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX512VL} & \texttt{1U << 15} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX512DQ} & \texttt{1U << 16} \\
     \texttt{GNU_PROPERTY_X86_ISA_1_AVX512BW} & \texttt{1U << 17} \\
    \end{tabular}
  \end{center}
\Hrule
\end{table}

The following bits are defined for \code{GNU_PROPERTY_X86_FEATURE_1_AND}:

\begin{table}[H]
\Hrule
  \caption{GNU_PROPERTY_X86_FEATURE_1_AND Bit Flags}
  \begin{center}
    \begin{tabular}[t]{l|l}
      \multicolumn{1}{c}{Name} & \multicolumn{1}{c}{Value} \\
      \hline
     \texttt{GNU_PROPERTY_X86_FEATURE_1_IBT} & \texttt{1U << 0} \\
     \texttt{GNU_PROPERTY_X86_FEATURE_1_SHSTK} & \texttt{1U << 1} \\
    \end{tabular}
  \end{center}
\Hrule
\end{table}

\begin{description}
 \item[GNU_PROPERTY_X86_FEATURE_1_IBT] This indicates that all executable
   sections are compatible with IBT (see Section~\ref{ibt}) when
   \code{endbr64} instruction starts each valid target where an indirect
   branch instruction can land.
 \item[GNU_PROPERTY_X86_FEATURE_1_SHSTK] This indicates that all
   executable sections are compatible with SHSTK (see Section~\ref{shstk})
   where return address popped from shadow stack always matches return
   address popped from normal stack.
\end{description}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "abi"
%%% End: