Huge thanks to our Platinum Members Endace and LiveAction,
and our Silver Member Veeam, for supporting the Wireshark Foundation and project.

Wireshark-dev: Re: [Wireshark-dev] How to evaluate hex/ebcdic packet data LUA

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Sun, 23 Oct 2016 23:48:17 -0700
On Oct 23, 2016, at 7:40 PM, Jerry White <jerrywhite518@xxxxxxxxx> wrote:

> I'm having a dickens of a time working with the packet data in my Lua dissector. I'm trying to see if a particular byte has a particular value. This byte exists in three different places in the below code, and all I want to do is test if it contains 0xc4, and I just can't get it right. Any help is appreciated.
> 
> 
> local mgi = Proto("mymgi", "Somos MGI Protocol")
> local pf_mgi_flag =  ProtoField.new("mgi_flag", "mymgi.mgi_flag", ftypes.STRING)
> 
> mgi.fields = {
> 	pf_mgi_flag
> }
> 
> local m_flag = Field.new("mymgi.mgi_flag") -- used for relational operations
> 
> function mgi.dissector(tvbuf, pktinfo, root)
> 	pktinfo.cols.protocol:set("SomosMGI")
>     	local pktlen = tvbuf:reported_length_remaining()
> 	local tree = root:add(mgi, tvbuf:range(0,pktlen))
> 
> 	local info_mgi_flag = tvbuf:range(19,1) -- used in wireshark info column

OK, so that field is a 1-character EBCDIC string?

> 	tree:add(pf_mgi_flag, tvbuf:range(19,1)) -- used in protocol tree

That won't work for EBCDIC.

All strings are kept as UTF-8 internally to Wireshark; this means that Wireshark translates them from the character encoding in the packet to UTF-8, and therefore that Wireshark must be told what the encoding for the field is.

Therefore, you should do

	tree:add_packet_field(pf_mgi_flag, tvbuf:range(19,1), ENC_EBCDIC)

to add it to the protocol tree.

To fetch the actual string, you'd need to do

	local mgi_flag = tvbuf:range(19,1)
	local info_mgi_flag = mgi_flag:string(ENC_EBCDIC)

> By the way, in the Wireshark tree it prints as \357\277\275,

That's EF BF BD in hex, or 11101111 10111111 10111101 in binary.

To put those in a form that matches a row in this description of UTF-8:

	https://en.wikipedia.org/wiki/UTF-8#Description 

that's 1110 1111 followed by 10 111111 followed by 10 111101, so the bit encoding of the Unicode code point in question is 1111111111111101, which is 1111 1111 1111 1101 or U+FFFD.

That's the Unicode REPLACEMENT CHARACTER:

	http://unicode.org/cldr/utility/character.jsp?a=FFFD

which is what Wireshark uses if it *can't* map something to UTF-8.

For the Lua APIs that *don't* take a full encoding value for a string, the encoding value ends up defaulting to 0, which means ENC_ASCII, which means "a 7-bit character set in which any value with the 8th bit set is invalid and gets mapped to REPLACEMENT CHARACTER".
 
> but in the Info column it displays as c4.

0xC4 has the 8th bit set, and is treated as invalid in an ENC_ASCII string.

It's 'D' in EBCDIC, according to

	https://en.wikipedia.org/wiki/EBCDIC

which means that "all I want to do is test if it contains 0xc4" means "all I want to do is test if it contains the character 'D'"; the actual string value, when fetched with ENC_EBCDIC, will be in UTF-8, the 7-bit subset of which is ASCII, so you would look to see if the string value contains the ASCII character 'D' (i.e., given that it's a 1-octent, and hence 1-character, string, whether it *equals* the ASCII string "D").