null terminated strings are incorrect

introduction

In the C programming language it is common to store text as null terminated strings. A null terminated string is a sequence of bytes ending in a null byte (0x00, more than one byte for wide characters) that represents a sequence of characters in some text encoding. Each character is mapped to one or more bytes according to the encoding.

C defines the following string types:

byte, example ASCII
multibyte, example UTF-8
wide, example UTF-32

All of them are null terminated. Each comes with syntax for creating literals in source code, and functions that operate on them like strcpy.

For example, the literal "hello" is a null terminated byte string represented in memory as 0x68 0x65 0x6c 0x6c 0x6f 0x00. The first 5 bytes come from the ASCII encoding. The next and last byte is the null byte. It doesn't correspond to any character in the original text. It only indicates the end of the string.

problem

Intuitively a programming language is expected to handle any string in the active encoding. It would be surprising if hello world was an acceptable string while hello earth was not.

C violates this expectation. It does so because it gives special meaning to the null byte. Encodings like ASCII and UTF-8 already assign meaning to the null byte. It encodes the null character. What exactly the null character means is not relevant. It only matters that it is a valid character and encoded as 0x00. This meaning clashes with the meaning C gives the null byte. A string supposed to contain the null character is instead cut short at its position.

For example, hello\0world (\0 is the null character) is a valid ASCII and UTF-8 string that could be a literal in source code or read from a file. It can even be typed on the right kind of keyboard, just like pressing the enter key types the newline \n character. But the string functions in the C standard library cannot handle the null character correctly because they treat it as the end of the string like in the following examples:

strcpy only copies hello.
strcmp returns that hello\0world and hello\0earth are equal.
printf only prints hello.

This is not a bug in these functions. It is a fault in the C standard for designing the string types with null termination.

solution

An alternative way to represent strings is with a pointer and a size. This is done in C++'s std::string, std::string_view and other languages. This representation does not have C's problem and has other technical benefits unrelated to correctness that this post does not go into.

A historical reason that C chose the null byte representation is that it saves memory. Only one extra byte is needed. This is no longer a good reason and maybe was not one even then. The gain in efficiency is not worth the loss in correctness.

Use better string types. Do not use C's null terminated string functions. Use libraries that handle strings correctly.

I would like to link a C library that replicates the standard library's string functions with pointer and size but I do not know one.

You can store string literals containing null bytes in arrays with const char text[] = "hello\0world";. This gives you access to the size of the string where a pointer would not. The ending null byte is still added.

examples of problematic software

SQLite

SQLite states its string type stores UTF-8. This is incorrect because SQLite uses null terminated strings. See documentation stating strings containing null bytes are undefined behavior. In practice they don't cause errors but are sometimes silently cut off.

documentation quote

In those routines that have a fourth argument, its value is the number of bytes in the parameter. To be clear: the value is the number of bytes in the value, not the number of characters. If the fourth parameter to sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length of the string is the number of bytes up to the first zero terminator. If the fourth parameter to sqlite3_bind_blob() is negative, then the behavior is undefined. If a non-negative fourth parameter is provided to sqlite3_bind_text() or sqlite3_bind_text16() or sqlite3_bind_text64() then that parameter must be the byte offset where the NUL terminator would occur assuming the string were NUL terminated. If any NUL characters occur at byte offsets less than the value of the fourth parameter then the resulting string value will contain embedded NULs. The result of expressions involving strings with embedded NULs is undefined.

program demonstrating null byte behavior

// Compile with `-std=C23`.

#include <assert.h>
#include <stdio.h>

#include "sqlite3.h"

int main() {
    sqlite3* db = nullptr;
    int result = 0;
    sqlite3_stmt* statement;

    const char* data = "first\0second";
    const int data_len = 13;

    result = sqlite3_open_v2(
        "a",
        &db,
        SQLITE_OPEN_CREATE | SQLITE_OPEN_READWRITE | SQLITE_OPEN_MEMORY,
        nullptr
    );
    assert(result == SQLITE_OK);

    result = sqlite3_exec(
        db,
        "CREATE TABLE table_(text TEXT, blob BLOB);",
        nullptr,
        nullptr,
        nullptr
    );
    assert(result == SQLITE_OK);

    result = sqlite3_prepare_v3(
        db,
        "INSERT INTO table_(text, blob) VALUES (?1, ?2);",
        -1,
        0,
        &statement,
        nullptr
    );
    assert(result == SQLITE_OK);
    result = sqlite3_bind_text(statement, 1, data, data_len, SQLITE_STATIC);
    assert(result == SQLITE_OK);
    result = sqlite3_bind_blob(statement, 2, data, data_len, SQLITE_STATIC);
    assert(result == SQLITE_OK);
    result = sqlite3_step(statement);
    assert(result == SQLITE_DONE);
    result = sqlite3_finalize(statement);
    assert(result == SQLITE_OK);

    result = sqlite3_prepare_v3(
        db,
        "SELECT text, blob FROM table_;",
        -1,
        0,
        &statement,
        nullptr
    );
    assert(result == SQLITE_OK);
    result = sqlite3_step(statement);
    assert(result == SQLITE_ROW);
    const int text_size = sqlite3_column_bytes(statement, 0);
    const int blob_size = sqlite3_column_bytes(statement, 1);
    result = sqlite3_finalize(statement);
    assert(result == SQLITE_OK);

    result = sqlite3_prepare_v3(
        db,
        "SELECT CAST(text as BLOB), CAST(blob as TEXT) FROM table_;",
        -1,
        0,
        &statement,
        nullptr
    );
    assert(result == SQLITE_OK);
    result = sqlite3_step(statement);
    assert(result == SQLITE_ROW);
    const int text_as_blob_size = sqlite3_column_bytes(statement, 0);
    const int blob_as_text_size = sqlite3_column_bytes(statement, 1);
    result = sqlite3_finalize(statement);
    assert(result == SQLITE_OK);

    result = sqlite3_prepare_v3(
        db,
        "SELECT text FROM table_ WHERE text LIKE '%second%';",
        -1,
        0,
        &statement,
        nullptr
    );
    assert(result == SQLITE_OK);
    result = sqlite3_step(statement);
    assert(result == SQLITE_ROW | result == SQLITE_DONE);
    const bool found_string = result == SQLITE_ROW;
    result = sqlite3_finalize(statement);
    assert(result == SQLITE_OK);

    result = sqlite3_close_v2(db);
    assert(result == SQLITE_OK);

    printf("text size: %i\n", text_size);
    printf("blob size: %i\n", blob_size);
    printf("text as blob size: %i\n", text_as_blob_size);
    printf("blob as text size: %i\n", blob_as_text_size);
    printf("found string: %i\n", found_string);
}

/*
possible output:

text size: 13
blob size: 13
text as blob size: 13
blob as text size: 13
found string: 0
*/

Postgres

PostgreSQL states its string type can store UTF-8. This is incorrect because Postgres uses null terminated strings. This is handled better than in SQLite because Postgres documents (search for NUL) the restriction and errors when a string containing a null byte would be used.