Brand Dimensions still won't stop scraping my site

| 1 Comment | No TrackBacks

Despite having a no BDFetch robots.txt directive, Brand Dimensions has downloaded hundreds of my original pages with photos on them. None of these pages mention any brand names of any companies, so I'm curious as to what BD is really doing. I'm guessing they could also provide some serious competitive intelligence to their clients. I just wonder what happens when they represent competing companies, like Coke and Pepsi. Here are some representative entries from my log files:

/var/log/httpd/access_log.1:72.14.164.139 - - [11/Aug/2009:07:25:27 -0400] "GET /carleton/reunionweb/WebPage-Full.00001.html HTTP/1.1" 200 1394 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.140 - - [11/Aug/2009:07:25:42 -0400] "GET /carleton/reunionweb/WebPage-Full.00015.html HTTP/1.1" 200 1468 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.197 - - [11/Aug/2009:07:25:57 -0400] "GET /carleton/reunionweb/WebPage-Full.00018.html HTTP/1.1" 200 1468 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.157 - - [11/Aug/2009:07:26:12 -0400] "GET /carleton/reunionweb/WebPage-Thumb.00023.html HTTP/1.1" 200 3648 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.179 - - [11/Aug/2009:07:26:27 -0400] "GET /carleton/reunionweb/WebPage-Full.00013.html HTTP/1.1" 200 1468 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.193 - - [11/Aug/2009:07:26:42 -0400] "GET /skiing/webdest/WebPage-Full.00011.html HTTP/

Brand Dimensions switched the name of their bot to sidestep robots.txt directives. Based on my own Google Analytics info, I can safely say a lot of people are interested in what Brand Dimensions is doing and how to stop it. More LinkWalker info here. Other webmasters report that the LinkWalker agent is also used by spambots harvesting email addresses for phishing attacks and the like.

Here are my latest robots.txt lines:
User-agent: BDFetch
Disallow: /
User-agent: BPImageWalker
Disallow: /
User-agent: VoilaBot
Disallow: /
User-Agent: LinkWalker/2.0
Disallow: /
User-Agent: LinkWalker
Disallow: /

No TrackBacks

TrackBack URL: http://cw.sampas.net/cgi-bin/mt6/mt-tb.cgi/189

1 Comment

I only know unix. So for .htaccess:

# domains
RewriteCond %{HTTP_REFERER} cbwatch.com [NC,OR]
RewriteCond %{HTTP_REFERER} copyscape\.com [NC]
RewriteRule ^(.*)$ - [F]

#user agents
Order Deny,Allow
Deny from env=bad_bot

BrowserMatchNoCase LinkWalker\/2\.0 bad_bot
BrowserMatchNoCase Mon_httpDownload bad_bot
BrowserMatchNoCase ZmEu bad_bot

robot.txt is for nice bots

Leave a comment

About this Entry

This page contains a single entry by Larry published on August 21, 2009 11:54 AM.

Airspace, Special Use, and Airport KML files updated (not all US Airports are in the Western Hemisphere) was the previous entry in this blog.

Airport AWOS Frequency/Phone KML file updated is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.