I'm a compulsive log-file reader - if I get bored I like to settle back, download Flarp's web server log, and have a casual read through the activity of the last day or two. Recently I saw the return of an old acquaintance - very dodgy activity from a web crawler bot identifying itself as "Mozilla/3.0 (compatible; Indy Library)".

It is my assertion that this agent is acting in very bad faith. The evidence (a couple of pages of Apache log file with annotations) is available here, this is my reasoning for disliking this agent:

  1. It is most definately an automated browser (commonly known as a web crawler), as determined by the methodical way in which it "surfs" all the links on Flarp at high speed, but never retrieving any graphics.
  2. It does not make any attempts to follow the Robots Exclusion Standard - visits from nice legitimate crawlers (such as those from search engines like GoogleBot) start with a request for the file robots.txt, which defines areas of the site which I don't want bots to visit, index or cache. This is such a de-facto standard that the request for robots.txt is conspicuous by its absence.
  3. The last time I saw this foul entity in my logs, it trawled the forums and every email address visible in them got spammed with the same spam very promptly.

Here are some other interesting things to note about this "thing", which I'm now personally convinced is just harvesting email addresses for spamming:

  • It's not very clever at remembering where it is. When it follows a link to a page in the Bar folder (containing stuff about my Flarp!Bar project), and then tries to follow links to other pages within that folder, it mucks up and thinks that the links are in the root of the site. For some reason it does not make this mistake on all sub-folders, but I can't see why.
  • It hits pages at the current "folder level" before traversing any links which move it to a deeper folder.

Now that I've grepped February's log too I notice that there's been several "minor" hits from this "thing" from fairly radically different IP addresses (all included in the evidence file). A quick check of the IP addresses that these scans came from indicates that 100% of the incidents I've quoted in the evidence file came from ISPs and networks in China.

I've now excluded the Indy Library user-agent from Flarp with either a little bit of PHP scripting or via an Apache .htaccess file - check this page in the Server Magic section for details.

Update: Turns out, after a little investigation, the Indy Library user-agent actually belongs to a Delphi/C++ Builder suite of tools for doing internet stuff - therefore, someone has written the spambot that I've noticed here in Delphi or C++ Builder and used this library. If you see little individual hits from something describing itself as Indy Library it could be some innocent application. Developers who use the Indy Library for legitimate reasons should really change the User Agent that their software sends when making HTTP requests, to avoid being tarred with the same brush.

Another Update (18/04/03): The particular spammer operating out of China that first alerted me to the abuse of the Indy Libraries have either rewritten their software or just changed the User-Agent header it's sending to now report itself as "Zeus 2.6". (btw: the "real Zeus" is a web server, not a spambot). I've just been spammed by them again (on behalf of trafficmagnet.com), here's the log. When I have the time to spare I will drastically overhaul the PHP method of excluding potential spambots to include a basic CAPTCHA just in case a legitimate visitor gets wrongly picked up as a spambot.

Oh No, More Updates (10/08/03): Ed Kohler has written in more length about the particular spammers who triggered this piece in the first place, you can find his article here.

