Preventing duplicate content issues
Before reading this article, please be aware that Digital Online Marketing Ltd cannot be held accountable for the loss of any data or the down-time…
Before reading this article, please be aware that Digital Online Marketing Ltd cannot be held accountable for the loss of any data or the down-time of any website. This article should be followed thoroughly and only performed if completely comfortable with editing a .htaccess file.
A common problem with websites is duplicate content and one thing that plagues a lot of website owners is duplicate content problems with the index files of websites, for example, all of these are classed as different pages with duplicate content:
- www.yourdomain.com/page/
- www.yourdomain.com/page/index.html
- yourdomain.com/page/
- yourdomain.com/page/index.htm
Rel=Canonical Will Not Do
There’s no denying that the rel=canonical tag is fantastic for helping search engines to know which page it needs to notice – but realistically it’s not a real solution. It’s just a tool that will prevent Google from penalising you, you need to get to the root of the problem.
WWW or non-WWW
URLs are always a tough decision in any business – a good fallback is to take a look at all corporate stationary to see what URL is on these, if they all have “www” then you need to make sure that your domain always resolves to “www.yourdomain.com”.
Automatically adding “www”
Adding “www” to your domain is really easy, all you need to do is open up the .htaccess file in the root folder of your hosting account and add the following few lines of code:
RewriteEngine on
RewriteCond %{HTTP_HOST} ^yourdomain.com$ [NC]
RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [L,R=301]
Automatically removing “www”
Sometimes it looks much nicer to automatically remove the “www”, particularly when you have a short, easy-to-remember URL. Websites like bit.ly are a great example of this. Here’s how to automatically remove the “www”.
RewriteEngine on
RewriteCond %{HTTP_HOST} ^www.yourdomain.com$ [NC]
RewriteRule ^(.*)$ http://yourdomain.com/$1 [L,R=301]
What does this mean?
This means that you’re already part of the way to preventing one of the main duplicate content errors. This means that, throughout the website, there will only be the potential of having “www” or “non-www” pages of the site.
Don’t forget to set your preferred domain in Webmasters
Google Webmaster Tool allows users to specify which version of the domain it should show in the SERPs. It’s doubtful that this has any effect on indexation, but it’s best to be consistent. So open up WMC, click on your domain, press “site configuration” followed by “settings”, then change the “preferred domain”:
A solution to a difficult problem with the index.* files
One problem webmasters come up with time and time again is indexation problems with the index files on their website. If you don’t know what this is, this is what servers automatically display as the “home” of the website. Certain servers are configured to use “home.htm”, for example but, most of the time, it is “index.*” that is most often used.
The problem was outlined at the top of this post and is caused when there is more than one page indexed (but they’re both the same thing!!). It can also be caused when a new website is created.
The reason it is such a problem is because it’s not as simple as using a basic 301 redirect in the htaccess file, as this will cause a redirect loop and your website will be inaccessible until you remove the 301 redirect.
The solution:
It’s another three lines, but it’s not something that’s easy for a .htaccess novice, here it is:
Options +FollowSymLinks
RewriteCond %{THE_REQUEST} ^.*/index.html
RewriteRule ^(.*)index.html$ http://www.yourdomain.com/$1 [R=301,L]
Obviously, make sure you change “.html” above to whatever the filetype of the index file is.
Written by Jason John Mills

