burglar_smallMy next post was going to be one about WordPress issues, but then something else came up.  That post will still go live on Wednesday.  Right now, I want to talk to you about content thieves and scrapers.

We had a run-in with some content scrapers two years ago.  That scraper took the content, but left the image links intact.  At the time, I showed how to defeat that particular variety of scraper.  This scraper, however, was trickier.

I’m not sure what the purpose of “BuzzMyFx” is beyond content hijacking.  If you “check” to see if your site is scraped by them (by going to YourSiteName.buzzmyfx.com), you might see that your site isn’t being scraped.  However, your mere act of checking will CAUSE them to start scraping your site.  Scraped sites have all content redirected through their servers.  Images, Stylesheets, JavaScript files, and more all seem to pour through BuzzMyFx’s servers instead of yours.  What’s worse is that, since all links go to BuzzMyFx now, clicking on a link to another site causes that site’s content

It didn’t take long to deduce what was going on.  BuzzMyFx is a server side scraper.  Imagine someone coming to your site under normal circumstances.  They tell their browser to load “www.MyWebSite.com”.  The browser then contacts the server hosting your site asking for that page.  The server gives the page to the browser which shows it to you.  Simple, right?

BuzzMyFx adds an extra layer.  If you go to MyWebSite.BuzzMyFx.com, your browser goes to BuzzMyFx’s server first.  BuzzMyFx’s server then contacts your server (as if it was a browser) for the page.  Your server gives the page to the “BuzzMyFx browser” as it does to all other browsers requesting pages. BuzzMyFx then alters the page’s code to direct all links back to them.  They also add in their own StatCounter script and change ad code to give them the revenue instead of the site owner.  Finally, they give the changed version of the page to you.

Pretty scummy, right?  Of course, by doing this, they are committing massive copyright infringement at the very least.  At $750 – $150,000 per infringement, dozens of infringements per site scraped, and possibly hundreds of thousands of sites affected, this could land them on the hook for millions of dollars.  Then there are the problems encountered if they are using a trademarked logo/name without authorization.

So how do you stop them?

Thankfully, servers keep logs of every visit.  As you loaded this up to read this post, my server dutifully recorded information such as your IP address, where you were referred from, the current date and time, and what page you were loading up.  This happens at all websites you visit, but not all people know how to read the logs.  As a webmaster, I am well versed in reading server logs.

I loaded up their scraped version of my site while checking my server logs and there it was: 192.151.156.170.  That was the IP address doing the scraping.

Next, I opened up my “.htaccess” file.  This is a special file on your web site that controls who can access your site and what they can and can’t see.  I added the following lines at the beginning:

RewriteCond %{REMOTE_ADDR} ^192\.151\.156\.170$
RewriteCond %{REQUEST_URI} !/content-thief.html
RewriteRule ^(.*)$ /content-thief.html [R,L]

Finally, I created a simple HTML page called “content-thief.html” with big, bold, red letters warning people that this was a scraped site and they should go to my real site.  (I didn’t link to my real site since the link would be altered, so I just spelled it out.)  You can go ahead and copy my “content-thief.html” page for your own usage.  Just be sure to change the site name to your own.

Unfortunately, BuzzMyFx has already cached some of my content, so the main page of my “BuzzMyFx-ed” site doesn’t show this warning.  Still, as their content expires and their server tries to grab the new content, it will be replaced by my warning.  (I went easy on them.  My initial reaction was to redirect them to some hard core pornography.  I didn’t want my name linked with that though.)

The other problem is that they can change their IP address which will let them bypass this rule.  I can add their new IP address in, but it will be a constant effort to keep up with them.  Perhaps the best remedy would be for all affected site owners to contact the people who run this “service.”  Unfortunately, they’ve hidden who they are from WHOIS, but they can’t hide two things:  1) Their domain name is registered from eNom and 2) Their site is hosted by CloudFlare.com DataShack.net.  If we can’t get them to stop, we can always get their hosting and domain name cut off.

Here’s hoping this scraper menace ends soon so we can all get back to producing great content instead of trying to protect our content from being scraped.

UPDATE:  CloudFlare.com is denying being their host.  As Heather commented below, they say they are a “reverse proxy, pass-through security service.”  I’m guessing that BuzzMyFx is using CloudFlare to hide their server’s real IP address.  However, the IP address I obtained that was seizing my content (192.151.156.170) isn’t “hidden” at all.  That IP address comes from DataShack.net.  So focus communication on them, not CloudFlare.

UPDATE #2:  If you aren’t technically inclined enough to know how to fiddle with htaccess and/or FTP files to your server, but you are using WordPress, you can also use the WP-Ban plugin to keep them off your site.  This plugin lets you list IP addresses and even leave a specific message for those IP addresses to see.

UPDATE #3: According to Lazy Budget Chef, even if you manage to contact BuzzMyFx, they will try to sell you a domain protection package to “steal the blogger’s legal right to their blog, their log in credentials, mailing list, and other personal information.”  So even if you manage to contact these scrapers, don’t sign anything they give you!  You shouldn’t need to sign some form of contract for them to cease scraping – they should just stop.  Be very wary of these people.

UPDATE #4: It looks like we’ve won this battle.  BuzzMyFx seems to be down.  They could still flee to another hosting provider (or even the same one signed up under a different account) and start their service back up.  Even if they don’t come back, I’m sure other scrapers will take BuzzMyFx’s place.  Still, you need to take each victory as it comes.  Congratulations and thanks for helping take down this scraper, everyone!

NOTE: The “burglar” image above is by tzunghaor and is available from OpenClipArt.org.