this post was submitted on 31 Aug 2024

11 points (100.0% liked)

General Programming Discussion

7706 readers

10 users here now

A general programming discussion community.

Rules:

Be civil.
Please start discussions that spark conversation

Other communities

Systems

Functional Programming

Also related

[email protected]

founded 5 years ago

MODERATORS

[email protected]

Are there better alternatives to null-terminated strings? (lemmy.ml)

submitted 2 weeks ago by [email protected] to c/[email protected]

14 comments fedilink hide all child comments

top 14 comments

sorted by: hot top controversial new old

[–] [email protected] 2 points 2 weeks ago (2 children)

Better in what sense? I put some thought into this when designing an object serialization library modelled like a binary JSON.

When it got to string-encoding, I had to decide whether to go null-terminated vs length + data? The former is very space-efficient, particularly when you have a huge number of short strings. And let's face it, that's a common enough scenario. But it's nice to have the length beforehand when you are parsing the string out of a stream.

What I did in the end was come up with a variable-length integer encoding that somewhat resembles what they do in UTF-8. It means for strings < 128 chrs, the length is a single byte. Longer than that and more bytes get used as necessary.

[–] [email protected] 1 points 2 weeks ago

a variable-length integer encoding that somewhat resembles what they do in UTF-8. It means for strings < 128 chrs, the length is a single byte. Longer than that and more bytes get used as necessary.

What you used might be similar to unsigned LEB128, which is used in DWARF, Webassembly, Android's DEX format, and protobuf. Essentially encodes 7 bits of the number in each byte, with the high bit being 1 in any byte except the last one representing the number.

Though unlike UTF-8 the number's length isn't encoded in the first byte but instead implied by the final byte. Arguably making the number's encoding similar to a terminated string.

[–] [email protected] 1 points 2 weeks ago (1 children)

What about data structures like gap buffer or piece table? Would they be ideal for something like, say, a TUI-interface application?

[–] [email protected] 1 points 2 weeks ago

Oh, so you're talking about text representation in an editor or something along those lines? That's kind of a separate problem isn't it?

At the lowest level though, I suppose you still need to consider whether to use null-terminated segments. I think I'd still be going length + data, though I wouldn't worry about packing down the length representation like with serialization formats. Your code will need to be highly cognizant of the length of strings and managing dynamic memory allocation all over the place, so it's good to have those lengths quickly accessible at all times.

[–] [email protected] 0 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

How would the machine know where the string would stop, since a string could contain literally any character?

But yeah... a .text section would be an alternative.

[–] [email protected] 8 points 2 weeks ago

@DmMacniel @velox_vulnus

Fixed length strings. You can only ever have strings of a particular length, no more, no less. No need to store the string length nor terminating characters. 🤓

[–] [email protected] 3 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

I am talking about modern, or slightly dated-but-easy-to-implement alternatives to C string, like for example, the pointer+length encoding method in Rust, (which is also called record method, I think?), or the Pascal string method.

[–] [email protected] 7 points 2 weeks ago (1 children)

You answered your own question. Strings with length are better than null terminated. It is a mistake in the original C language library and probably a hack because the pdp11 used asciz format.

[–] [email protected] 2 points 2 weeks ago (1 children)

Lower performance though. At each iteration through the string you need to compare the length with a counter, which if you want strings longer than 255 characters will have to be multibyte. With NTS you don't need the counter or the multibyte comparison, strings can be indefinitely long, and you only need to check if the byte you just looked at is zero, which most CPUs do for free so you just use a branch-if-[not-]zero instruction.

The terminating null also gives you a fairly obvious visual clue where the end of the string is when you're debugging with a memory dump. Can you tell where the end of this string is: "ABCDEFGH"? What about now: "ABCD\0EFGH"?

[–] [email protected] 1 points 2 weeks ago

It's lower performance in the one situation of iterating on an 8bit ASCII string for programs written 30 years ago but faster in more common uses. Multibyte doesn't matter when everything is 64 bit. A 64 bit length counter is long enough for everything but the most edgy of edge cases. You take a performance hit if you aren't aligned.

Can you tell where the end of this string is: "ABCDEFGH"? What about now: "ABCD\0EFGH"?

No because unicode and binary formats means a string can contain anything.

[–] [email protected] 3 points 2 weeks ago (2 children)

Another alternative I've seen is strings that are not null terminated but where the allocated memory actually begins at ptr[-1] and contains the length of the string. The benefit is that you still get a char array starting at ptr[0].

[–] [email protected] 3 points 2 weeks ago (1 children)

But wouldn't this be potentially unsafe? What programming language has this type of implementation, by the way?

[–] [email protected] 4 points 2 weeks ago* (last edited 2 weeks ago)

Hmm I think I saw it in a C library

Edit: Might have been this one https://github.com/msteinert/bstring

Edit: actually seems it's this one. Look at what happens to ystr_header_t https://github.com/Amaury/Ylib/blob/master/src/ystr.c

[–] [email protected] 1 points 2 weeks ago

This reminds me of when I had to roll my own dynamic memory allocator for an obscure platform. (Something I never want to do again!) I stuck metadata in the negative space just before the returned pointer like you say. In my case, it was complicated by the fact that you had to worry about the memory alignment of the returned pointer to make sure it works with SIMD and all that. Ugh. But I guess with strings (or at least 8-bit-encoded strings), alignment should not be an issue.