Friday, 16 September 2016

18 Reasons Your Website is Crawler-Unfriendly

You’ve been working on your website really hard, and can’t wait to see it on the top of the search, but your content is struggling to overcome the 10th page hurdle. If you are sure that your website deserved to be ranked higher, the problem might exist within your website crawlability.

What is crawlability? Search engines use search bots for collecting certain website pages parameters. The process of collecting this data is called crawling. Based on this data, search engines include pages in their search index, which means that page can be found by users. Website crawlability is its accessibility for search bots. You have to be sure that search bots will be able to find your website pages, obtain access and then “read” them.
The information about crawlability issues over web is fragmented and sometimes controversial. So we decided to list in one place all possible reasons why your website may be crawler-unfriendly.

We also break these issues down into two categories: those you can solve by your own and those you need to involve a developer or a system administrator in. Of course, all of us have different background and skills, so take this categorization tentatively.

1. Blocking the page from indexing through robots meta tag
If you do this, the search bot will not even start looking at your page’s content, moving directly to the next page.

You can detect this issue checking if your page’s code contains this directive
<meta name="robots" content="noindex" />
2. No follow links
In this case the search bot will index your page’s content but will not follow the links. There are two types of no follow directives:

for the whole page. Check if you have
<meta name="robots" content="nofollow">
in the page’s code - that would mean the crawler can’t follow any link on the page.
for a single link. This is how the piece of code looks like in this case:
3. Blocking the pages from indexing through robots.txt
Robots.txt is the first file of your website the crawlers look at. The most painful thing you can find there is
User-agent: * Disallow: /
It means that all the website’s pages are blocked from indexing.
It might happen that only certain pages or sections are blocked, for instance
User-agent: * Disallow: /products

4. URL errors
A URL error is usually caused by a typo in the URL you insert to your page (text link, image link, form link). Be sure to check that all the links are typed in correctly.

5. Outdated URLs
If you have recently undergone a website migration, a bulk delete or a URL structure change, you need to double-check this issue. Make sure you don’t link to old or deleted URLs from any of your website’s pages.

6. Pages with denied access
If you see that many pages in your website return, for example, a 403 status code, it’s possible that these pages are accessible only to registered users. Mark these links as nofollow so that they don’t waste crawl budget

7. Server errors
A large number of 5xx errors (for example 502 errors) may be a signal of server problems. To solve them, provide the list of pages with errors to the person responsible for the website’s development and maintenance. This person will take care of the bugs or website configuration issues causing the server errors.

8. Limited server capacity
If your server is overloaded, it may stop responding to users’ and bots’ requests. When it happens, your visitors receive the “Connection timed out” message. This problem can only be solved together with the website maintenance specialist who will estimate if and how much the server capacity should be increased.

9. Web server misconfiguration
This is a tricky issue. The site can be perfectly visible to you as a human, but it keeps giving an error message to a bot, so all the pages become unavailable for crawling. It can happen because of specific server configuration: some web application firewalls (for example, Apache mod_security) block Google bot and other search bots by default. In a nutshell, this problem, with all the related aspects, must be solved by a specialist.
The Sitemap, together with robots.txt, counts for first impression to crawlers. A correct sitemap advises them to index your site the way you want it to be indexed. Let’s see what can go wrong when the search bot starts looking at your sitemap(s).

10. Format errors
There are several types of format errors, for example invalid URL or missing tags (see the complete list, along with a solution for each error, here).
You also may have found out (at the very first step) that the sitemap file is blocked by robots.txt. This means that the bots could not get access to the sitemap’s content.

11. Wrong pages in sitemap
Let’s move on to the content. Even if you are not a web programmer, you can estimate the relevancy of the URLs in the sitemap. Take a close look at the URLs in your sitemap and make sure that each one of them is: relevant, updated and correct (no typos or misprints). If the crawl budget is limited and bots can’t go throughout the entire website, the sitemap indications can help them index the most valuable pages first.

Don’t mislead the bots with controversial instructions: make sure that the URLs in your sitemap are not blocked from indexing by meta directives or robots.txt.

12. Bad internal linking
In a correctly optimized website structure all the pages form an indissoluble chain, so that the crawler can easily reach every page.

In an unoptimized website certain pages get out of crawlers’ sight. There can be different reasons for it, which you can easily detect and categorize using the Site Audit tool by SEMrush:
  1. The page you want to get ranked is not linked by any other page on the website. This way it has no chance to be found and indexed by search bots.
  2. Too many transitions between the main page and the page you want ranked. Common practice is a 4-link transition or less, otherwise there’s a chance that the bot won’t arrive to it.
  3. More than 3000 active links in one page (too much job for the crawler).
  4. The links are hidden in unindexable site elements: submission required forms, frames, plugins (Java and Flash first of all).
13. Wrong redirects
Redirects are necessary to forward users to a more relevant page (or, better, the one that the website owner considers relevant). Here’s what you can overlook when working with redirects:

Temporary redirect instead of permanent: Using 302 or 307 redirects is a signal to crawlers to come back to the page again and again, spending the crawl budget. So, if you understand that the original page doesn’t need to be indexed anymore, use the 301 (permanent) redirect for it.
Redirect loop: It may happen that two pages get redirected to each other. So the bot gets caught in a loop and wastes all the crawl budget. Double-check and remove eventual mutual redirects

14. Slow load speed
The faster your pages load, the quicker the crawler goes through them. Every split second is important. This is how a website’s position in SERP is correlated to the load speed

Use Google Pagespeed Insights to verify if your website is fast enough. If the load speed could deter users, there can be several factors affecting it.

Server side factors: your website may be slow for a simple reason – the current channel bandwidth is not sufficient anymore. You can check the bandwidth in your pricing plan description.

Front-end factors: one of the most frequent issues is unoptimized code. If it contains voluminous scripts and plug-ins, your site is at risk. Also don’t forget to verify on a regular basis that your images, videos and other similar content are optimized and don’t slow down the page’s load speed.

15. Page duplicates caused by poor website architecture

Duplicate content is the most frequent SEO issue, found in 50% of sites according to the recent SEMrush study "11 Most Common On-site SEO Issues." This is one of the main reasons you run out of crawl budget. Google dedicates a limited time to each website, so it’s improper to waste it by indexing the same content. Another problem is that the crawlers don’t know which copy to trust more and may give priority to wrong pages, as long as you don’t use canonicals to clear things up.

To fix this issue you need to identify duplicate pages and prevent their crawling in one of the following ways:
  • Delete duplicate pages
  • Set necessary parameters in robots.txt
  • Set necessary parameters in meta tags
  • Set a 301 redirect
  • Use rel=canonical
16. JS and CSS usage
Yet in 2015 Google officially claimed: “As long as you're not blocking Googlebot from crawling your JavaScript or CSS files, we are generally able to render and understand your web pages like modern browsers.” It isn’t relevant for other search engines (Yahoo, Bing, etc.) though. Moreover, “generally” means that in some cases the correct indexation is not guaranteed.

17. Flash content
Using Flash is a slippery slope both for user experience (Flash files are not supported in some mobile devices) and SEO. A text content or a link inside a Flash element are unlikely to be indexed by crawlers.
So we suggest simply don’t use it on your website

18. HTML frames
If your site contains frames, there’s good and bad news that come along with it. It’s good because this probably means your site is mature enough. It’s bad because HTML frames are extremely outdated, poorly indexed and you need to replace them with a more up-to-date solution as fast as possible.

Delegate Daily Grind, Focus on Action

It’s not necessarily wrong keywords or content related issues that keep you floating under Google’s radar. A perfectly optimized page is not a guarantee that you will get it ranked in the top (and ranked at all), if the content can’t be delivered to the engine because of crawlability problems.

Post a Comment