Published in Blogging on Monday, February 20th, 2006
Yep, it was that time again. Time to do the random searches and see who has been lifting content off of some of my other sites and my client's sites.
With the proliferation of splogs and the ease of chewing up and spitting out RSS, it has become an almost monthly habit of mine to do a little searching for people who have been ripping off content on sites that I manage.
Here's how I find it, and what I do to deal with the issue.
Google of course works fairly well for finding duplicate content on the web, but the tool of choice for this task is Copyscape.
I like to go in and test a few random pages along with the money pages to see what I can find. Sadly I almost always find something. Happily, though, it can usually be resolved quite quickly, because the law is on your side - at least that has been my experience.
I find that flagrant reproduction of RSS feeds is an easy one to handle. A simple note explaining that If the content is not removed in 48 hours we will be advising your hosts, your registrar and the major search engines of the infraction
will usually get the ball rolling quite nicely.
Aside: I once called a guy after getting his whois info, and let me say that was very effective, though I don't recommend it as it is far easier to keep your cool in writing.
People tend not to put up a fight (often replying that their tech guy was responsible - sheesh), but if they do, fire them this link to help them get informed (though they likely know that they're on the dark side of the law). This part is rarely necessary, but it's a nice touch because they will take you very seriously if they understand that you know where you are coming from.
Plain theft of copy (as in not RSS republishing) can be a bit more difficult (who's copy is it?), but quite often, as Mike Davidson explains here, people who have been caught generally back down quite quickly. (Replace the word you in that comment with ISP and it is a decent description of how a DMCA complaint against a site works.)
In the end the thieves will usually back down - quite often the fear of losing their Adsense account is enough motivation.
A lot of people think that this is a losing battle, and truth be told, it can be a difficult issue to keep tabs on.
The major issue for me is duplicate content in Google - I've had other sites ranking above mine where they are running my content - I'm not a big fan of that.
So for me it is worth keeping a lookout once in a while. This is especially easy with newer sites that get little traffic, but if you know your sites well enough, you'll can see dips in traffic to certain areas that you know should be higher. That can be a sign that it's time to do some research!
For those of you who want to find out more, Copyscape provides Responding to Plagarism, a Resource Center, and it you have a lot of time on your hands, the forums.
Sitepoint's web devlopment books have helped me out on many occasions both for finding a quick solution to a problem but also to level out my knowlegde in weaker areas (JavaScript, I'm looking at you!). I am recommending the following titles from my bookshelf:
I started freelancing by diving in head first and getting on with it. Many years and a lot of experience later I was still able to take away some gems from this book, and there are plenty I wish I had thought of beforehand. If you are new to freelancing and have a lot of questions (or maybe don't know what questions to ask!) do yourself a favor and at least check out the sample chapters.
The author line-up for this book says it all. 7 excellent developers show you how to get your JavaScript coding up to speed with 7 chapters of great theory, code and examples. Metaprogramming with JavaScript (chapter 5 from Dan Webb) really helped me iron out some things I was missing about JavaScript. That said each chapter really helped me to develop my JavaScript skills beyond simple Ajax calls and html insertion with libs like JQuery.
Like the other books listed here, this provides a great reference for the PHP developer looking to have the right answers from the right people at their fingertips. I tend to pull this off the shelf when I need to delve into new territory and usually find a workable solution to keep development moving. This only needs to happen once and you recoup the price of the book in time saved from having to develop the solution or find the right pattern for getting the job done..
Comments and Feedback
Well, theft can be fought, as you point out very nicely. But there are also legitimate copies (just remember site mirrors), so search engines must (should) be very careful about penalties, especially when they are automated.
Personally, I face the situation that I republish articles or interviews I do for online magazines or other people on my site (that's implicit if you want something written by/with me), and it's certainly not acceptable to be penalized for this since it's absolutely legitimate. Well, nothing happened yet, and that hopefully remains constant.
Hey Jens,
Yeah, mirroring sites can be a difficult issue. Not a bad idea to keep the bots out of mirrors with the robots.txt file.
I remember that Keith had a problem with mirroring his site a while back, but this is all I can find on it now...
I have a problem with my site and newsisfree.com. newsisfree.com reproduces my RSS feed. As a result, my own site is ranked way lower on google! Seriously, search on google and newisfree is 3rd, my site is on the second page!
I asked newsisfree to remove my site, and they said they are an RSS aggregate like Bloglines and others. They haven't yet removed it. Since they are only displaying my RSS feed, I'm not sure what I can do against that... I'm worried that Google is penalizing my site, thinking I'm the one ripping off content!
Ah, thanks for the tips 'n links! I've been on the lookout for something like this for a while.
I guess the best solution to 'protect' your RSS-feed is to publish only excerpt and let readers go to your site for the full post...
*heads off to check check if his sites have copies*
Have a read of this, Jesse. Pretty certain that they can't do that, I don't care what they are, just because you have rss doesn't mean that people can reproduce it.
In the meantime, you could block their bot. Check RSS user agent identifiers for the UA string, and try banning them with the robots text, or if you want to be sure, look for
NIF/1.1
in the UA string and deal with it via htaccess or PHP etc.Very interesting article Mike. I did see copyscape a long time ago but had forgotten about it. After checking a few pages I directly found a copy of one of my pages somewhere. A pdf on some personal home directory, so it's probably meant as a personal backup. But duplicate nonetheless.
Thanks.
Thanks for the tip, Mike. Unfortunately, I'm using Feedburner, so I can't easily block them. Makes me seriously consider keeping my feed on my own server, though. Until then, I'll just keep writing nasty emails...
Jesse, are you using the direct feedburner link or redirecting your feed transparently?
If it is #2, you can block it... (seems that you arent though)..
Sigh, I'm using the direct link. I think I'll change this around, though. It makes more sense to have a permanent URL on my server, especially if I can still gain the benefits of Feedburner.
So long, newsisfree!
It would be funny if someone lifted this post and credited as their own.
Well, funny in an ironic way.
Ha, that would be funny..
*mike makes a note to check this post in a week or so*
You could do all that, or you could just produce some more content. You see creativity existed long before copyright, and the way that 'artistic' types managed to survive then was simple. They kept working.
The model these days seems to be:
1)Produce
2)Copyright
3)Collect until you die
The model then was:
1)Produce
2)Be copied
3)Produce More, and be even more liked because everyones heard of you from the copycats.
This is the fundamental flaw with copyright, and the current system of enforcing it. It breeds stagnation instead of advancement of the arts and sciences as it was intended too.
I digress though. It was a well thought out and useful article, I just hate the implications behind it.
My theory: Let people 'steal' your content, just make more. The copycats obviously can't so in the end you will win.
Interesting point of view, Steve, and to be honest I do like that.
It is much easier to write new stuff and blog and be creative, but then there is the odd time you come across your content wrapped with Adsense on someone else's site, and if you've happened to have had a bad day...
Newsisfree got back to me. They updated http://newsisfree.com/sources/info/28520/ so that users not logged in (Google included) only get the first sentence of each blog post. This should help out.
(As a little kick in the junk, they also updated my RSS feed to the new feedburner feed I made, which I had done just to evade them in the first place! Anyway, I asked them to point at the one on my server instead.)
I fight this often. Most pull the feed right away, but if they don't, another way to get action is to post about it. "xxxxx.com" is ripping me off" seems to work. There aggregator automatically post on their own site. Then I send them the link to their own blog. It's good for a laugh anyway.