Wireshark-dev: Re: [Wireshark-dev] Request for RFC regarding string handling
From: Ed Beroset <[email protected]>
Date: Mon, 28 Oct 2013 20:03:17 -0400
Evan Huus wrote:
Does anybody have (or feel like developing) a Grand Unified Theory of
Wireshark's future string handling? Michael and Guy and I took a stab
at something like this in the comments of [2] but it's a bit
disjointed and we never really came to a consensus that I recall. Does
anyone know if the switch to Qt has any affect on this (does it make
sense to adopt QStrings everywhere, for example?)
I'll go ahead and toss another (related) log on the pile: should we be 
thinking about allowing for internationalization?  We wouldn't 
necessarily need to actually provide the translations, but using the 
existing Qt framework to allow internationalization might be a good idea 
up front and may also help us work out some of the string handling.
The next time one of these issues pops up I would love to know already
how we *ought* to behave.
The difference between Wireshark and many other tools is that it's 
required to still "do the right thing" even with broken string 
encodings. Both the machine encoding and the partially-rendered human 
version may be required.
I don't have a Grand Unified String Theory handy, but can think of some 
requirements for it.  One is that it may need to be able to render a 
number of different encodings, including the various Unicode variations, 
ASCII, and maybe some others such as KOI8 and maybe even EBCDIC. 
Mappings will have to be sensitive to both the encoded length and be 
able to do something reasonable even with malformed encoded strings.
As more thought experiment than serious proposal, imagine that every 
protocol-based string (as contrasted with help screens or parts of the 
GUI) has something like the following structure:
typedef struct {
	encoding machine_form;  /* an enum of encodings */
	encoding human_form;	/* an enum of renderings */
	guint machine_len;	/* length of encoded form */
	guint human_len;	/* length of rendered form */
	guint8 **encoding_err;	/* array of pointers to
		encoding errors within machine form,
		or NULL if no errors */
	guint8 *machine;	/* pointer to encoded */
	guint8 *human;		/* pointer to rendered */
} string_s;

Is anything missing? For example, do we need to have something like "reason codes" corresponding to each encoding error? Is anything redundant?
Also, if we make the possibly rash assumption that Unicode is the 
superset, perhaps we can regularize the addition of new renderings by 
requiring conversions to and from Unicode and routines that can create
an array of pointers (or maybe offsets) of encoding errors in the 
encoded version of the string.
Perhaps a look at wide characters and locales as implemented in C++ 
could be useful, at least in terms of inspiration or at least getting 
some more concrete ideas on the scope of the problem.
Ed