DISQUS

Gabriel Weinberg's Blog: How-to stop most people from spidering your site and stealing content

  • Gene tani · 3 months ago
    (2nd attempt at comment, delete if dup)

    Just as a first round of (not terribly effective) defense, you can hash your primary key field or URL:

    http://blog.michaelgreenly.com/2008/01/obsifica...

    http://stackoverflow.com/questions/67890/whats-...

    And the usual first setup step, limits on port 22, 80, 443 in iptables, still worth mentioning

    http://kevin.vanzonneveld.net/techblog/article/...
  • robdimarco · 3 months ago
    Anyone doing serious scraping will just use Tor or other anonymization tactics on their IPs. If you really want to block them, a couple of suggestions:

    - Tie results to sessions rather than URL parameters. Tie the session to the IP. It makes it much harder to go through a result set across the multiple IPs. In the past, this could cause issues with some users as they came from places like AOL that shifted their IPs, but that is pretty rare now.
    - If you DO recognize someone scraping, I would strongly recommend sending back bogus data rather than 404 or some other errors. In automated systems, that is much more of a pain in the ass to detect that there is a problem.
  • Gabriel Weinberg · 3 months ago
    I should have mentioned initially that these are static sites, like http://www.ivegotafang.com/ that just involve just static HTML. I could of course still do some session stuff via JS, but what I was trying to say above is this method has worked well in practice for the last few years. I have my logs tailed all the time, so I usually see anything egregious immediately.

    In the Apache config, recognized robots currently get sent a redirect to the Pakistani governement site, which is really slow. I've considered doing that for all returns, but just haven't yet. I did that for the reasons you said.

    With regards to getting around it, certainly it's pretty easy to do so, and I mentioned Tor at the end of the post. All I can say is repeat that I watch the logs and at least for my site this simple super system deters most. Now maybe that has to do with the nature of the content and if I wanted to add something for say Duck Duck Go I'd have to get more sophisticated.