Bare metal printf – C standard library without OS

by oqtey
Bare metal printf - C standard library without OS

Today we’ll take a look at how we can leverage Newlib to create a compact C standard library for usage on a bare metal system. In a small example, we’ll implement a few UART primitives and pass them on to Newlib which uses them as the building blocks for a full-blown printf functionality. The target platform will be RISC-V, but the concepts should, as usual, be the same for other platforms as well.

Table of contents

Open Table of contents

Software abstractions and C standard library

When running printf on a typical, fully operational, end-user system (e.g., a Mac or a Linux laptop), we invoke a pretty complex machinery. The application process calls the printf function, which is more often than not dynamically linked, and after a few layers of different C functions, a system call to the operating system kernel is typically invoked. The kernel will route the output through different subsystems: different terminal and pseudo-terminal primitives will be invoked, and at some point, you will also want to visually see the printf output on your screen. That also likely invokes a pretty thick stack of abstractions in order to render the characters on your screen. We won’t even talk about how printf formats the output strings based on the provided templates.

On a bare metal system, however, most of these abstractions are not available at all and the stack is much thinner.

If we’re working on bare metal, we don’t have anything below our C functions supporting us. In the full-blown example above, the process would hand over the output to the kernel through system calls, which are implemented through software interrupts. However, now we don’t have anything to hand over to, yet we want to have something like printf working, ideally outputting to a simple I/O device like UART.

This is where Newlib jumps in. You’re probably familiar with different flavors of C standard library like GNU (glibc), musl and so on, but Newlib should definitely be on your radar if you’d like to enable C standard library on bare metal.

More accurately, the way I think about Newlib is not as a C standard library, but rather as a kit to build a custom, compact C standard library.

Newlib concept

Rather than requiring you to implement the whole C standard library from scratch, Newlib boils down the implementation to a few very basic primitives with clean interfaces that can be implemented as separate functions, and then other more complex functions like printf and malloc will call these primitives. Just for intuition, we will be implementing primitives like _write which essentially writes a single character to the output stream, and Newlib builds printf on top of that in order to write more complex outputs.

In addition to providing this simple set of primitives to implement, Newlib also gives reasonable pre-cooked implementations as well. In one of the configurations, you can even still target Linux as the underlying platform instead of bare metal, and the provided implementation will do system calls like glibc would do. Also, if you’re going for the absolutely minimal config, Newlib will provide all the primitives in a minimal form where they just return zeroes or raise an error (equivalent to something like raising an unimplemented exception in Python or Java).

Either way, you will implement whatever building blocks you actually care about in your application and the rest would rely on the default implementation.

Let’s switch gears here and talk about the cross-compilation toolchains. Cross-compilation happens when you compile from one platform to another. Intuitively, you can think of something like cross-compiling from an x86_64/Linux platform to ARM64/Mac.

On platforms like Linux, though, things can get a lot more nuanced, as Linux platform doesn’t necessarily mean one flavor of C standard library, so I’d refer to these platforms more accurately as x86_64/Linux/glibc. When you look at platforms from that perspective, even compiling from a platform with one standard library to another on the same x86_64/Linux setup, but with a different C library, you are effectively still cross-compiling. A concrete example would be cross-compiling from x86_64/Linux/glibc to x86_64/Linux/musl.

Furthermore, if you want to be extremely accurate and disciplined (as you should be if you want to build software that doesn’t break!), even building from one version of glibc for another is really cross-compiling. Again, as an example, building from x86_64/Linux/glibc_v1.0 for x86_64/Linux/glibc_v1.1 is cross-compilation.

This can quickly get difficult, at least with the traditional way of building and using compilers (such as GCC, for example); however, those “ancient” ways are still stuck with us for the foreseeable future. I will soon write in more detail about this, and for now, we’ll use a shortcut described below.

Toolchain details

We want a toolchain that satisfies two requirements:

  1. it builds from our host platform to RISC-V, i.e. it generates RISC-V instructions
  2. it uses the Newlib library when C standard library functionality is invoked

If you’re on a typical Linux distribution, you likely have something like GCC (or even clang) installed. When you simply run GCC on a C file without any fancy flags, what happens is that the compiler will simply build for the same platform it runs on. More accurately, the host and the target are the same, and I believe the formal term for this is native compilation. The reason why I bring this up is to ask ourselves what happens when we include something like stdio.h and call something like printf? Where is this .h file really pulled from and where is ultimately the printf implementation found so it can be linked against?

This really depends on the way your compiler was built. When you build GCC from source, and you run ./configure, you can specify a ton of flags that will drive this behavior. As promised, I will write more about it in the future. For now, let’s keep in mind that most Linux distributions we use in daily lives follow the old UNIX philosophy when it comes to this. For example, my Debian installation has stdio.h in a standard directory at /usr/include. Furthermore, my standard C library (glibc) that can be dynamically linked is at /lib/x86_64-linux-gnu/libc.so (which really points to /lib/x86_64-linux-gnu/libc.so.6). Similarly, there is an .a file in there, but I will assume you know what .so and .a files are for. So, long story short, skipping a lot of details, your native compiler is set up to look for the C library in some of the standard spots, and when it builds for the same platform, it simply picks up the libraries from there.

Therefore, we need to:

  1. get the compiler that can generate instructions for the desired platform (machine code)
  2. set up the C standard library for that particular platform somewhere
  3. ensure that the compiler for the target platform knows how to use the library from above

From what I’ve seen, when it comes to cross-compilation, this is a fair amount of grungy work that needs to be done. The set up above takes a lot of building time and needs to be done in stages when done properly. A future article will go into details, but for now, as we mentioned before, we’ll go for a shortcut.

Remember for now that we want the includes, .so/.a files at some path, and we want the cross-compiler to look there for the C standard library, not at the host’s include and lib directories. In this case since we want to build from something like x86_64 to RISC-V, it’s easy to spot errors, since if we use the host’s libraries, there is no way the wrong architecture would work, but when compiling for the same architecture and a different software platform, host contamination can be a real thing and can lead to very subtle and annoying problems! For example, we want the library code to be searched for at /usr/local/risc_v_stuff/lib instead of /usr/lib.

Automated RISC-V toolchain build

For this exercise, let’s simply use the RISC-V toolchain. This project will still build everything from source on our host machine, but the whole annoying orchestration mentioned above, including staging the compilers, will be scripted and automated for us. With a few commands, we’ll kick off the process that will effectively set up something like /usr/local/risc_v_stuff/lib, /usr/local/risc_v_stuff/include, /usr/local/risc_v_stuff/compiler and you’ll be able to invoke /usr/local/risc_v_stuff/compiler/gcc which will know to peek into the right directories for different files and will build the right machine code. Of course, the paths will ultimately be different, but this should be good enough as a concept.

We can start by cloning the Git repository linked above. The instructions say that the --recursive flag is not necessary during cloning and things will be dynamically pulled later, but for whatever reason, this did not work on my system. I ended up running a clone with the --recursive flag to avoid issues. It took a lot of time and space though, pulling in gigabytes and gigabytes of source code.

Once the endless clone is done, you can configure the build. This is how I configured it:

./configure --prefix=/opt/riscv-newlib --enable-multilib --disable-gdb --with-cmodel=medany

I strongly encourage you to run ./configure --help to see what all the available options are and customize the build. For now, I will explain my parameters:

  1. prefix is simply where we’ll install the newly built artifacts, such as the cross-compiler, C standard library (Newlib in our case) and so on.
  2. enable-multilib will enable builds for different RISC-V setups. As a reminder, RISC-V has a ton of flavors, like RV32I, RV32IMA and so on. Please do note that enabling this flag will make your build super slow. If you don’t want to run the build with multilib, then check the help menu to figure out how to build exactly for the fine grained platform that you need.
  3. disable-gdb: for whatever reason, building GDB would always fail for me, so I just excluded it from the toolchain. Real engineers debug with printf anyway!
  4. with-cmodel: hold on to this one, I will reveal this in a ‘gotcha’ moment; for now, let’s just keep in mind I needed this in order to make the 64-bit RISC-V builds work.

Now that your build is configured, you can fire off the build process and leave it cooking for quite some time. One thing that surprised me here is that they didn’t use separate make and make install steps. Everything is done through just make, both the compilation and installation of the artifacts.

Note: I wanted to parallelize the build with -j16 as I normally do, but that also somehow broke my build, so I suggest running without this, and yes, I know it takes forever.

I simply ran

sudo make

in order to place the final artifacts in the /opt directory. This whole process is very slow, so make sure you have something else to do while this is working.

Also, you may wonder where Newlib comes in here, when I mentioned that this whole process will automate how we get Newlib available. The answer is that Newlib is simply the default option for building our toolchain here. You can check the GitHub documentation on how to set up your cross-compiler to target a RISC-V glibc or musl, but what I’ve listed above is good enough to get a cross-compiler with Newlib as the target.

I have prepared this repository to run our example. The code explanations are below (hopefully they’re not out of sync with the repo itself).

Implementing the memory and UART building blocks

Now that you have a working cross-toolchain targeting RISC-V + Newlib, most of the heavy lifting is done and we can start putting together the Newlib building blocks. Let’s begin with UART, and the first file is uart.h:

#ifndef UART_H
#define UART_H

void uart_putc(char c);
char uart_getc(void);

#endif

This is self-explanatory so far. Let’s see how these functions are implemented:

#include "uart.h"

// QEMU UART registers - these addresses are for QEMU's 16550A UART
#define UART_BASE 0x10000000
#define UART_THR  (*(volatile char *)(UART_BASE + 0x00)) // Transmit Holding Register
#define UART_RBR  (*(volatile char *)(UART_BASE + 0x00)) // Receive Buffer Register
#define UART_LSR  (*(volatile char *)(UART_BASE + 0x05)) // Line Status Register

#define UART_LSR_TX_IDLE  (1 << 5) // Transmitter idle
#define UART_LSR_RX_READY (1 << 0) // Receiver ready

void uart_putc(char c) {
    // Wait until transmitter is idle
    while ((UART_LSR & UART_LSR_TX_IDLE) == 0);
    UART_THR = c;

    // Special handling for newline (send CR+LF)
    if (c == '\n') {
        while ((UART_LSR & UART_LSR_TX_IDLE) == 0);
        UART_THR = '\r';
    }
}

char uart_getc(void) {
    // Wait for data
    while ((UART_LSR & UART_LSR_RX_READY) == 0);
    return UART_RBR;
}

The code above was AI-generated, but it’s accurate. And this is it as far as our UART driver is concerned. How does that now work with Newlib?

We switch to the file called syscalls.c. Here, we implement the functions that printf would rely on. We’ll also handle the input as well, just for fun. First, what happens here is we implement the primitives for writing to a file handle. The only file handles we’ll really support here are stdout and stderr. And to be perfectly accurate, there are no files here; we’re just intercepting the C standard library calls that otherwise work with these concepts.

Moving further, we provide super minimal implementations for a few more building blocks. They’re extremely basic, like the _close function, which essentially never allows any file handle to be closed.

The one building block that is very interesting here is _sbrk. This is what gets invoked when the routines for dynamic memory allocation like malloc (needed by the printf family of functions) need to ask the OS (when there is one) to provide more raw memory to the process, that can then be fragmented into smaller logical units by malloc. What happens here is we find the symbol _end defined by the linker, which marks the _end of the static sections (we’ll see how below) and we start using the memory past that address for heap allocations, all the way until we hit the stack. Once we hit the stack, we declare that an error as we have run out of memory.

void* _sbrk(int incr) {
    extern char _end;         // Defined by the linker - start of heap
    extern char _stack_bottom; // Defined in our linker script - bottom of stack area

    static char *heap_end = &_end;
    char *prev_heap_end = heap_end;

    // Calculate safe stack limit - stack grows down from _stack_top towards _stack_bottom
    char *stack_limit = &_stack_bottom;

    // Check if heap would grow too close to stack
    if (heap_end + incr > stack_limit) {
        errno = ENOMEM;
        return (void*) -1; // Return error
    }

    heap_end += incr;
    return (void*) prev_heap_end;
}

Please note that the stack top and bottom here refer to the beginning and the end of the memory block allocated for the stack, not the logical top or bottom of the stack itself from the application perspective.

Application example: input and output

We’re now ready to put the actual bare metal application together. If you need a refresher on bare metal programming on RISC-V, check it out again. That article covers the key addresses and the basics of putting a bare-metal linker script together.

The app code itself is very self explanatory:

#include 

int main(void) {
    printf("Hello from RISC-V UART!\n");

    char buffer[100];
    printf("Type something: ");
    scanf("%s", buffer);
    printf("You typed: %s\n", buffer);

    while (1) {}

    return 0;
}

Please note that when we’re inputting something to this app, we won’t see our key presses echoed. This is because we’re not operating inside some sort of shell environment. The implementation as it is will simply accept the key presses and store them in the internal memory structure. We’ll see what was typed when we hit the final printf.

We now need to also put together a small C runtime. When we develop a binary for an everyday OS, we typically don’t have to think about this, and the compiler will inject the standard startup runtime which takes care of setting up the process for proper execution and passing the control on to the main function.

Our minimalistic runtime will set up the stack pointer register, zero-fill the BSS section per C standard, and then call the main code. Just for good measure, we also leave an infinite loop at the end in case main returns. Again, with a proper OS below our code, a system call would be invoked to properly close the process, and it wouldn’t just loop infinitely.

.section .text.init
.global _start

_start:
    la sp, _stack_top

    # Clear BSS section - using symbols defined in our linker script
    la t0, _bss_start
    la t1, _bss_end
clear_bss:
    bgeu t0, t1, bss_done
    sb zero, 0(t0)
    addi t0, t0, 1
    j clear_bss
bss_done:

    # Jump to C code
    call main

    # In case main returns
1:  j 1b

One of the most important parts of our application now is the linker script:

OUTPUT_FORMAT("elf64-littleriscv")
OUTPUT_ARCH("riscv")
ENTRY(_start)

MEMORY
{
  RAM (rwx) : ORIGIN = 0x80000000, LENGTH = 64M
}

SECTIONS
{
  /* Code section */
  .text : {
    *(.text.init)
    *(.text)
  } > RAM

  /* Read-only data */
  .rodata : {
    *(.rodata)
  } > RAM

  /* Initialized data */
  .data : {
    *(.data)
  } > RAM

  /* Small initialized data */
  .sdata : {
    *(.sdata)
  } > RAM

  /* BSS section with explicit symbols */
  .bss : {
    _bss_start = .;  /* Define BSS start symbol */
    *(.bss)
    *(COMMON)
    . = ALIGN(8);
    _bss_end = .;    /* Define BSS end symbol */
  } > RAM

  /* Small BSS section */
  .sbss : {
    _sbss_start = .;
    *(.sbss)
    *(.sbss.*)
    . = ALIGN(8);
    _sbss_end = .;
  } > RAM

  /* End marker for heap start */
  . = ALIGN(8);
  _end = .; /* Heap starts here and grows upwards */

  /* Stack grows downward from the end of RAM */
  _stack_size = 64K;
  _stack_top = ORIGIN(RAM) + LENGTH(RAM);
  _stack_bottom = _stack_top - _stack_size;

  /* Ensure we don't overlap with heap */
  ASSERT(_end <= _stack_bottom, "Error: Heap collides with stack")
}

Per our previous investigation of bare metal programming for the QEMU VM, we know that the user-provided code will begin executing from 0x80000000. Therefore, what we do is put the C runtime code that we previously wrote right there. In other words, our assembly code will be planted right at that memory address. Following our C runtime is the rest of the code, i.e. the text section. In this case, this is the C code we have written plus the C standard library we’re linking into our binary.

After that, we place the other sections like rodata, data, bss, and so on. The linker script will capture the symbols for BSS start and end so it can be zero-filled by the C runtime, as seen above. There’s also a small BSS section, but the C runtime code doesn’t do anything about it, to stay compact, as it’s not used by the application. It probably should be zero-filled as well.

Then, we capture where the small BSS section ends because that also marks the end of our static sections. Following that, we let the growing heap consume everything, up until the stack. The stack occupies the last 64K of memory (and we described RAM as being 64M at the top of the linker script). A few addition and subtraction operations are done to determine where this exactly is, and we do an assert check to make sure there is no collision between the heap and the stack.

The concept is simple: we identify the “void” between the static sections and the stack, and we let the C standard library that we’re putting together via Newlib to maintain the growing heap in there. A real kernel like Linux would do its memory management magic here, handle the virtual addresses and so on, but here we really have one “process” and a simple memory extension operation sbrk is enough for what we want to achieve here.

The ‘gotcha’ moment

Now let’s reflect back on the fact that we configured the toolchain to be built with the --with-cmodel=medany flag. What does this flag really control, and why did we need it?

If you read the top of the linker script carefully, we’re building for a 64-bit RISC-V machine. Per QEMU, our instructions will begin at 0x80000000, and we decided to simply lay the rest of the code after that. To handle these high values, we need to use the correct machine instructions to handle these high addresses. So our application code likely needs to use the memory address model which can handle any address, and so we build our logic with -mcmodel=medany. To be compatible, our C standard library also needs that.

If we didn’t have the aforementioned flag, the Newlib library would be built with RISC-V instructions that cannot effectively use such high addresses. Remember, the C standard library is pre-built before our application. The build system will simply pick up the machine code from the relevant library directory and link it to your application code. If the addresses do not fit the value range that the instructions support, the linker is not able to make things work.

As I understand, there is a concept of linker relaxation, where the linker itself can make the code modifications, but I don’t think it would help in this case.

I don’t want to spend too much time on this, I hope the explanation above suffices, and if you would like to learn more about this problem, check out this link, where the reporter had linker errors and a solution was offered.

Running the app

I’ve included a Makefile in the GitHub repo. Check it out to see what exactly is going on there, especially how the cross-compiler is invoked (should be the first line of the file), as well as QEMU for emulation. I will highlight a few things here.

One of the CFLAGS is -specs=nosys.specs. This will drive the toolchain to use the ‘nosys’ flavor of Newlib. This is the most minimal flavor where all the building blocks are just stubs by default that return zeroes or errors.

Linker flags include -nostartfiles which means that we’ll be providing our own minimal C runtime, that we have described above.

The rest of the Makefile should be fairly easy to follow. I strongly suggest using the debug target though. We’ll just go ahead and run:

make debug

The QEMU process starts, I punch in foo and hit enter, and after getting back my app’s output, I stop QEMU:

$ make debug
/opt/riscv-newlib/bin/riscv64-unknown-elf-gcc -march=rv64imac_zicsr -mabi=lp64 -mcmodel=medany -specs=nosys.specs -O2 -g -Wall -c main.c -o main.o
/opt/riscv-newlib/bin/riscv64-unknown-elf-gcc -march=rv64imac_zicsr -mabi=lp64 -mcmodel=medany -specs=nosys.specs -O2 -g -Wall -c uart.c -o uart.o
/opt/riscv-newlib/bin/riscv64-unknown-elf-gcc -march=rv64imac_zicsr -mabi=lp64 -mcmodel=medany -specs=nosys.specs -O2 -g -Wall -c syscalls.c -o syscalls.o
/opt/riscv-newlib/bin/riscv64-unknown-elf-gcc -march=rv64imac_zicsr -mabi=lp64 -mcmodel=medany -specs=nosys.specs -O2 -g -Wall -c startup.S -o startup.o
/opt/riscv-newlib/bin/riscv64-unknown-elf-gcc -march=rv64imac_zicsr -mabi=lp64 -mcmodel=medany -specs=nosys.specs -O2 -g -Wall -T link.ld -nostartfiles   -o firmware.elf main.o uart.o syscalls.o startup.o
/opt/riscv-newlib/lib/gcc/riscv64-unknown-elf/14.2.0/../../../../riscv64-unknown-elf/bin/ld: warning: firmware.elf has a LOAD segment with RWX permissions
qemu-system-riscv64 -machine virt -m 256 -nographic -bios firmware.elf -d in_asm,cpu_reset -D qemu_debug.log
Hello from RISC-V UART!
Type something: You typed: foo

The reason why I suggest using the debug target is because it drops a file called qemu_debug.log. That file is pretty cool as it shows you a complete trace of what your VM has been through. Naturally, you can inspect all the Newlib code if you want to figure out how exactly printf works, but I thought it’s still a pretty nice view of what the RISC-V core actually sees. Since we’re building an ELF file and passing it to QEMU, it’s even able to tell us which function we’re exactly in. It doesn’t have that for the first couple of instructions since we’re, as a reminder, executing the initial hardcoded bootloader, and then our initial C runtime, before jumping into the main function. If the first few instructions before 0x80000000 confuse you, please check out RISC-V boot process with SBI to understand what’s going on. Excerpt of my debug log is below:

----------------
IN:
Priv: 3; Virt: 0
0x0000000000001000:  00000297          auipc                   t0,0                    # 0x1000
0x0000000000001004:  02828613          addi                    a2,t0,40
0x0000000000001008:  f1402573          csrrs                   a0,mhartid,zero

----------------
IN:
Priv: 3; Virt: 0
0x000000000000100c:  0202b583          ld                      a1,32(t0)
0x0000000000001010:  0182b283          ld                      t0,24(t0)
0x0000000000001014:  00028067          jr                      t0

----------------
IN:
Priv: 3; Virt: 0
0x0000000080000000:  04000117          auipc                   sp,67108864             # 0x84000000
0x0000000080000004:  00010113          mv                      sp,sp
0x0000000080000008:  00015297          auipc                   t0,86016                # 0x80015008
0x000000008000000c:  d5828293          addi                    t0,t0,-680
0x0000000080000010:  00015317          auipc                   t1,86016                # 0x80015010
0x0000000080000014:  d5030313          addi                    t1,t1,-688
0x0000000080000018:  0062f663          bleu                    t1,t0,12                # 0x80000024

----------------
IN:
Priv: 3; Virt: 0
0x0000000080000024:  0be020ef          jal                     ra,8382                 # 0x800020e2

----------------
IN: main
Priv: 3; Virt: 0
0x00000000800020e2:  7119              addi                    sp,sp,-128
0x00000000800020e4:  00011517          auipc                   a0,69632                # 0x800130e4
0x00000000800020e8:  db450513          addi                    a0,a0,-588
0x00000000800020ec:  fc86              sd                      ra,120(sp)
0x00000000800020ee:  10e000ef          jal                     ra,270                  # 0x800021fc

----------------
IN: puts
Priv: 3; Virt: 0
0x00000000800021fc:  85aa              mv                      a1,a0
0x00000000800021fe:  00012517          auipc                   a0,73728                # 0x800141fe
0x0000000080002202:  ffa53503          ld                      a0,-6(a0)
0x0000000080002206:  b7bd              j                       -146                    # 0x80002174

Conclusion

With this example, we have ported some of the very powerful features over to our bare-metal platform, and we somewhat retained the feeling of coding on top of a proper kernel. We could keep going and enable things like “file” access, memory management and so on.

In fact, what is really interesting here is that the door is now open to use some powerful libraries in our bare-metal code, that are otherwise not necessarily expecting a bare metal environment. Some library could expect to open a file and if the only way it does it is through using the C standard library, we can essentially intercept that API call and without passing the request to the kernel, we can service it in our bare metal code.

And the concept to do this was quite simple: depend on the building blocks that Newlib defines, provide your own implementation that takes precedence over the Newlib defaults, and use the defaults for whatever you don’t care about.

Of course, in absolutely minimal environments, the size of the final software image can be a concern, as well as the amount of instructions we’re injecting, but looking at the ELF file that we build in our project, it’s at 220K which doesn’t really sound too bad. Ultimately, however, it is up to you to decide what abstractions you will use in your project. This should be one of the tools in your toolbox that can hopefully save you some time in your development.

Good luck with your hacking!

Please consider following on Twitter/X and LinkedIn to stay updated.

Related Posts

Leave a Comment