Huge thanks to our Platinum Members Endace and LiveAction,
and our Silver Member Veeam, for supporting the Wireshark Foundation and project.

Wireshark-bugs: [Wireshark-bugs] [Bug 5738] Pango-WARNING **: Invalid UTF-8 string passed to pan

Date: Thu, 22 Sep 2011 16:33:08 -0700 (PDT)
https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=5738

Guy Harris <guy@xxxxxxxxxxxx> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Platform|Other                       |All
         OS/Version|Linux (other)               |All

--- Comment #3 from Guy Harris <guy@xxxxxxxxxxxx> 2011-09-22 16:33:06 PDT ---
The message body is

    ХРИСТОМ БО...

but the version of Wireshark I'm using screws up the И - but gets the ХР and СТ
right.  The display code is escaping some, but not all, octets with the 8th bit
set - it's escaping 0x98 but not 0xA5, for example.

I suspect this string is being run through format_text(), which is treating the
string as a sequence of single-octet code points, and isprint() is deciding
that 0xA5 is printable but 0x98 isn't.

It is, I think, time to have format_text() treat its argument as UTF-8, not as
some unspecified other ASCII extension with all characters being a single
octet, and to do mapping such as:

   C0 control characters -> the Unicode characters intended to display them, or
to a \XXX escape;

   valid UTF-8 characters that are printable -> themselves;

   valid UTF-8 characters that aren't printable -> something;

   octet sequences not valid in UTF-8 -> Unicode REPLACEMENT CHARACTER
(0xFFFD).

I.e., format_text() needs to generate valid and displayable UTF-8, not valid
and displayable ISO 8859-n or any other single-byte extended-ASCII character
set.

-- 
Configure bugmail: https://bugs.wireshark.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are watching all bug changes.