[Babase] Lots of darcs processes on papio?

Ryan Hardy rh87 at duke.edu
Wed Apr 28 12:30:48 EDT 2010


On Apr 27, 2010, at 5:41 PM, Karl O. Pinc wrote:

> I tried a robots.txt that denies spider access to the darcs
> archive but it seems that Microsoft's Bing, at least,
> also detects old robots.txt copies in the darcs archive
> that allow spidering.  Bing _should_ ignore all robots.txt
> that are are elsewhere than at document root but it does
> not and appears to choose the most liberal policy it can
> find.  I don't _think_ google's got this problem, but I'm
> not sure.  Google seems to be polite about it's spidering
> but Bing hits the box hard and so that's where I focused
> my investigations.

Yeah, the MSNbot was the culprit last night too.

> The easy answer is to turn off the darcs web interface.
> My plan was to disable the darcs web interface if it became
> a problem until we upgrade the OS and get something newer.
> 
> Suggestions?

If it's not affecting your usage of the box, I don't particularly have any desire to kill functionality for you.

Actually, I don't think the bots are even seeing the robots.txt at all, the way httpd is currently configured.  They don't do HTTPS requests, so they are getting redirected to the wiki page via HTTPS (per the RewriteCond/Rule), unless I am misinterpreting the current config.

Adding the following to /etc/httpd/conf.d/Xbabase.conf between lines 25 and 26 should do the trick, I think:

  # Allow crawlers to get robots.txt so they stop killing us. -rnh, 04/28/2010
  RewriteCond   %{REQUEST_URI} ^/robots.txt$
  RewriteRule  .*  - [L]

That should allow non-HTTPS requests for the file to succeed, which will hopefully make the bots quite so annoying.  I didn't want to edit the file in the assumption that you have it checked in/out via some version control mechanism.  I'd be happy to do it, otherwise.

-Ryan


More information about the Babase mailing list