Huge thanks to our Platinum Members Endace and LiveAction,
and our Silver Member Veeam, for supporting the Wireshark Foundation and project.

Wireshark-dev: Re: [Wireshark-dev] ctype.h calls

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Tue, 28 Oct 2014 12:17:47 -0700
On Oct 28, 2014, at 10:56 AM, Jeff Morriss <jeff.morriss.ws@xxxxxxxxx> wrote:

> Just catching up on 3 weeks of traffic on the the -commits list...
> 
> Is there any reason the remaining ctype.h calls in master shouldn't be removed [and the functions put on the prohibited list in checkAPIs.pl]?

The remaining calls in Wireshark proper (I'm leaving the build tools out, at least for now), at least based on what files are still including ctype.h, are:

	in the H.245 dissector, a call to isascii() used to decide whether to display something as text or hex;

	in the S1AP dissector, a call to isalpha(), which is in a loop that is being used to check whether something should be displayed as a text string;

	in file.c, calls to toupper() and tolower() in string matching code;

	in wsutil/strnatcmp.c, calls to several functions in the "Perform 'natural order' comparisons of strings in C" routine;

	in wsutil/strptime.c, isspace() used when matching white space in an input string.

In the first two, I *suspect* that what's really intended is "is this printable ASCII?", in which case both should use g_ascii_isprint(), although if the S1AP dissector really wants to check for *alphabetic* characters, g_ascii_isalpha() could be used.

In file.c, I think that code is primarily (and possibly exclusively) used for the Find function in Wireshark and, for that, if the user requested a case-insensitive search:

	when searching packet summary lines and lines from the detailed dissection, they might want a search that's case-insensitive, *using the rules of their locale*, *and treating both the string being searched for and the strings in which the search is being done as being encoded as UTF-8* (which is what they both should be), which is a significant change;

	when searching raw packet data, making the search automatically "do the right thing" would be extremely difficult (as the raw packet data might be in arbitrary encodings, and the only way to determine the encoding of a particular set of bytes would be to see what encoding was specified when it was dissected) - currently we support a vague sort of byte-oriented encoding that I guess is ASCII and a vague sort of 2-byte-oriented encoding that I guess you could think of as UTF-16 but it never matches anything outside the ASCII range, and maybe we should just have both matches never match anything outside the ASCII range.

In wsutil/strnatcmp.c, the "natural order" appears, from

	http://sourcefrog.net/projects/natsort/

to sort strings such that numbers in the strings are sorted in numerical order:

	Computer string sorting algorithms generally don't order strings containing numbers in the same way that a human would do. Consider:

		rfc1.txt
		rfc2086.txt
		rfc822.txt

	It would be more friendly if the program listed the files as

		rfc1.txt
		rfc822.txt
		rfc2086.txt

	Filenames sort properly if people insert leading zeros, but they don't always do that.

The routines in there are used to sort encapsulation type names for "-T" in the help output from editcap and mergecap; those are all ASCII, so using the g_ascii_XXX() routines would work.  If we want to sort strings that might *not* be nerd tokens in a natural order, in order to show them to a user, we might want to do a "dictionary sort", which would be locale-dependent.  I'd vote for stuffing "ascii" into the names of the wsutil/strnatcmp.c routines, to make it clear that the case-insensitive "natural order" sort routine will always treat A-Z and a-z as equivalent (including treating "I" and "i" as equivalent, with neither being equivalent to "İ" or "ı") and will not ever treat anything else as equivalent (including not, for example, treating "Ä" and "ä" as equivalent), and, if we ever need a "natural human dictionary order" sort, worrying about that problem at that point.