Huge thanks to our Platinum Members Endace and LiveAction,
and our Silver Member Veeam, for supporting the Wireshark Foundation and project.

Ethereal-dev: Re: [Ethereal-dev] While we're on the subject of new frametypes...

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: Devin Heitmueller <dheitmueller@xxxxxxxxxxx>
Date: 13 Dec 2002 10:33:17 -0500
Right now the problem is that we leave all the decision making to the
individual dissector.  Some dissectors send 8859-1, some send ASCII,
some strip out the null bytes from UCS-2 and send what it thinks is
ASCII.  This decision making should all be in one place, and then we can
easily perform platform independent hacks for display if necessary.

I would recommend that FT_STRING represent ASCII strings only (no
extended ASCII support).  Since there are numerous variants of extended
ASCII, it should be left to the dissector to decide which character set
they are using.  This means that we should add types like
FT_STRING_ISO8859-1 and FT_STRING_UCS2_LE.  Of course, the protocol
dissectors that do use 8859-1 (or some other character set) would have
to be changed, but I am confident that this short term cost is well
worth it.

A change as described above would also limit the displaying of garbage
strings in ASCII string display, since we would be able to easily
recognize non-ASCII input (by looking for values > 127).

Then proto_tree_add_item() would perform conversion as necessary:

If the type is FT_STRING_ISO8859-1, it would convert to UTF-8.  
If it was FT_STRING_UCS2_LE, it would convert to UTF-8.
If it was FT_STRING_UCS2_BE, it would convert to UTF-8.
If it was ASCII, it would convert to UTF-8.  
See a pattern?

If UTF-8 display was broken on some particular platform, we would only
have to change one piece of code to convert UTF-8 to ASCII or
ISO8859-1.  Of course, the cost of such a conversion would be the loss
of international characters, but if the platform is broken....

At least this approach would provide consistent behavior between
dissectors and consolidate the bulk of the conversion work into one
place within the core.

-Devin

On Fri, 2002-12-13 at 02:14, Guy Harris wrote:
> On Fri, Dec 13, 2002 at 04:22:39PM +1100, Tim Potter wrote:
> > How about a new frametype for unicode strings?
> 
> Big-endian, or little-endian?  You could tell "proto_tree_add_item()"
> what the byte order is; "proto_tree_add_ustring()", however, would
> probably need to take a byte order argument.
> 
> There's currently a commented-out FT_UCS2_LE in "epan/ftypes/ftypes.h",
> for 2-byte little-endian Unicode.  We could perhaps implement that.
> 
> However, I think there are some things we should think about before
> doing Unicode (even if we don't come to a conclusion on all of them
> first - we might be able to temporarily punt on the display and printing
> issues by discarding or printing/displaying as an escape sequence
> non-ASCII characters, so those issues may not require immediate
> resolution):
> 
> 	1) What should we do about other extended-ASCII character sets? 
> 	   Currently, we don't do anything clever, which means that, for
> 	   example, ISO 8859/1 strings might work OK if you're running
> 	   on some UNIX flavor with the locale set to an 8859/1 locale,
> 	   but don't work in other locales?
> 
> 	   Should we make them Unicode strings, and have the dissector
> 	   translate them from the character set in question to Unicode?
> 	   Making the character set a property of the field might not
> 	   work - for example, that wouldn't work for OEM character sets
> 	   in SMB, as that'd have to be something set by an SMB
> 	   preference item at run time.  It might work for the Mac
> 	   character set in Appletalk, however.
> 
> 	2) As long as we're going down that path, should we store *all*
> 	   strings as Unicode in the protocol tree, and just keep the
> 	   existing FT_STRING types, and:
> 
> 		perhaps have the byte-order argument to
> 		"proto_tree_add_item()" specify, for FT_STRING types,
> 		the character set and, in cases where a multi-byte
> 		character type can come in either byte order, the byte
> 		order;
> 
> 		add a character set+byte order argument to
> 		"proto_tree_add_string()"?
> 
> 	   That complicates life for GTK+ 1.2[.x], as you have to figure
> 	   out what character encoding is being used for the font, and
> 	   translate into that.  However, GTK+ 2.x, and the Win32 GTK+
> 	   1.3[.x], use UTF-8, so we should be able to make that work
> 	   reasonably well.  Doing so *might* fix *some* of the problems
> 	   people are reporting on Windows.
> 
> 	   Recent versions of Qt use Unicode or UTF-8, so a KDE version
> 	   should be able to handle that, if we do one.
> 
> 	   I don't know offhand what Aqua uses, but I wouldn't be
> 	   surprised if you could get it to use Unicode or UTF-8.
> 
> 	   You can use Unicode for applications running on Windows NT
> 	   (NT 4.0, 2K, XP, .NET Server), so any native Windows GUI (or
> 	   Packetyzer) should be able to make that work.  Windows OT
> 	   (95, 98, Me) is another matter; there is the "Microsoft Layer
> 	   for Unicode on Windows 95/98/Me Systems":
> 
> 		http://msdn.microsoft.com/library/default.asp?url=/library/en-us/win9x/unilayer_4wj7.asp
> 
> 	   which might help - however, that *might* also affect non-GUI
> 	   APIs, causing them to use Unicode as well.  If so, we'd have
> 	   to deal with that somehow.
> 
> 	   Text output gets tricky.  On Windows, if you do a "print to
> 	   file" in Network Monitor 2.0, it prints out a Unicode text
> 	   file (which is a bit annoying if I wanted an ASCII text file,
> 	   although "tr"ing it on UNIX can end that annoyance by
> 	   stripping out the extra null bytes).  We could, I guess, do
> 	   that on Windows for Tethereal and printing, although we might
> 	   have to further Windowsify the printing code to make that
> 	   work right.
> 
> 	   On UNIX, if we can find some way to translate from Unicode or
> 	   UTF-8 to the locale's character set, we could do that for
> 	   Tethereal and printing.  The iconv library *might* handle
> 	   that, although that'd require the native iconv library to
> 	   handle UTF-8 or Unicode - I'm not sure all of them do; I seem
> 	   to remember some version of Solaris having some special
> 	   add-on developer's pack to add UTF-8 support, so it might not
> 	   handle it in that and earlier Solaris versions, although I
> 	   think Solaris 8 handles it natively - or force us to require
> 	   GNU iconv on platforms that lack a version of iconv that can
> 	   handle Unicode or UTF-8.
> 
> > Currently they can
> > either be displayed as a normal string in which case you get the first
> > character, or as a bunch of bytes which isn't very attractive.
> 
> Or you could de-Unicodeize them and use FT_STRING-family types, which is
> better than a poke in the eye with a sharp stick, but doesn't handle
> non-ASCII characters.  I think we do that in some places.
> _______________________________________________
> Ethereal-dev mailing list
> Ethereal-dev@xxxxxxxxxxxx
> http://www.ethereal.com/mailman/listinfo/ethereal-dev
-- 
Devin Heitmueller
Senior Software Engineer
Netilla Networks Inc