-----Original Message-----
From: wireshark-users-bounces@xxxxxxxxxxxxx [mailto:wireshark-users-bounces@xxxxxxxxxxxxx] On Behalf Of Jeffs
Sent: mercredi 11 août 2010 15:07
To: Community support list for Wireshark
Subject: Re: [Wireshark-users] filter for ONLY initial get request
This formula, however, only returns results minus the links and images
embedded in the web page:
tshark -r test.cap -T fields -e http.host | sed 's/?.*$//' | sed -n
'/www./p' | sort | uniq -c | sort -rn | head -n 100
15 www.propertyshark.com
8 www.nytimes.com
2 www.google-analytics.com
1 www.facebook.com
However, I am new to regex so I'm sure I may be missing something or
losing some links.
It is a common mistake to consider that every websites have their main
address on a "www" subdomain. If you want a generic filter, you cannot
rely on it. If you want a relevant result, you'll have to build a
non-restrictive regexp and manually filter unappropriate results,
eventually making some rules to exclude well-known advertising sites.
A fully automatic solution would be to parse the data checking it is
a well-formed html (or xml or plain-text) document. This will purge
videos and images from your results.