Huge thanks to our Platinum Members Endace and LiveAction,
and our Silver Member Veeam, for supporting the Wireshark Foundation and project.

Wireshark-dev: Re: [Wireshark-dev] Request for RFC regarding string handling

From: Evan Huus <eapache@xxxxxxxxx>
Date: Tue, 29 Oct 2013 10:46:59 -0400
On Mon, Oct 28, 2013 at 8:03 PM, Ed Beroset <beroset@xxxxxxxxxxxxxx> wrote:
> Also, if we make the possibly rash assumption that Unicode is the superset,
> perhaps we can regularize the addition of new renderings by requiring
> conversions to and from Unicode and routines that can create
> an array of pointers (or maybe offsets) of encoding errors in the encoded
> version of the string.

I think we more-or-less have to take Unicode as the superset, because
AFAIK none of the UI toolkits available will render anything else.

According to what's been gestating in my brain, the outstanding
questions (in order they probably need to be answered) are:

1. How do we handle valid but non-printable characters in strings? We
currently have a mishmash of different C-style escapes, replacement
characters, and "nothing" (which is really "whatever our UI toolkit
does"). Should we pick one? Should we make it a user option? Should it
be dependent on the context? On the protocol? On the field? Should our
replacement character *always* be U+FFFD (the unicode replacement
character) or should we also permit using - or . or any other
character that might be useful?

2. How do we handle "broken" strings (eg claim to be UTF8 but don't
follow UTF8 encoding rules)? We currently have a mix of assertions and
expert info and "nothing" (again meaning "whatever our UI toolkit
does"). It would be useful to decode as much as possible, and annotate
errors, etc but that becomes almost worthy of a program in its own
right.

3. How do we represent strings internally to the dissection engine? We
are pretty standardized right now on null-terminated ASCII, but some
places use UTF8, some use counted strings, etc. My vote here would be
to standardize on counted UTF8 of some sort, since that is relatively
simple to manipulate and is capable of representing any string I can
think of (including embedded nulls, which keep popping up as a
problem).

4. Given 1-3, what APIs do we expose to dissectors?

5. Given 4, how do we get there from here?

Cheers,
Evan