Crawler Traps, how to prevent and fix them
Crawler traps make it difficult or even impossible for a crawler to crawl your website efficiently. Crawler traps hurt the crawl process and make it difficult for your website to rank.
What are crawler traps
Crawler Traps or Spider traps is a techical term for an issue the the structure of a website. Crawler traps generate a never ending amount of urls that a spider cannot possible crawl. As a result the spider will get stuck in this trap and never get to the 'good' parts of your website.
Good practice tips and tricks to prevent crawler traps
As crawler traps go, prevention is better than cure. Crawler traps usually originate from a technical design flaw. Fix the design flaw and do not try to cover the issue.
- Block duplicate pages in your robots.txt file
- A correct cannonical url will prevent duplicate content but not crawl budget issues
- Adding nofollow to links stops the link from passing pagerank. It does not prevent crawler traps
What is the issue with spider traps?
Spider traps may slow down the discovery of important new pages and changes and cause issues with the quality ans structure of of a website
1. Crawl budget issues
Each website receives a crawl budget from Google. A crawl budget is the amount of requests (note: this is note the same as the amount of pages!) Google is willing to make to your website. When your crawl budget is 'waisted' on irrelevant pages there might not be enough budget left to quicly discover new exiting content and discover recent changes to your site.
Googlebot is able to detect most spider traps. Once a spider trap is detected Google will stop crawling the trap and lower the crawl frequency of those pages. However detection of a crawl trap may take Google some time and after detection crawl budget is still being wasted on the spider trap, only less the before.
2. Quality issues
Most crawler traps consist of infinite loops of the same page(s). Each page is basically the same as the previous page. This cause duplicate content issues. Duplicate content is a sign of a low quality website. Googlebot is able to detect and filter duplicate content pages. However this process takes time and in not infallible. If only 0.00001% of infinate pages are not flagged by Google this still causes serious issues.
Common crawler traps. How to identlfy and fix them
These are the most common crawler traps. We will explain how to identify and fix each crawler trap
- https / subdomain redirect trap
- Filter trap
- Never-Ending URL trap
- Time trap
- Infinite redirect trap
- Session url trap
1. https / subdomain redirect trap
This is actually the most common crawler trap that we come across. A site is running on a secure https connection and every page of the old 'non-secured' version is redirected to the secured version of the homepage.
http://www.example.com/page1 redirects to https://www.example.com
The problem with this redirect
The problem of this redirect is that seach engine spiders like Googlebot never completely figure out how to redirect the old non-secured pages. In the example above http://www.example.com/page1 should have been redirected to to https://www.example.com/page1. Instead it is redirected to the homepage. Most crawlers will identify this as an incorrect redirect. The old page is not updated to the new location but rather labelled as soft 404. Googlebot will retry and retry to crawl this page causing your site to leak your precious crawl budget.
The same issue occurs when a request to example.com/page1 redirects to www.example.com/ Note that in the first request there is no 'www'.
How to detect the https / subdomain redirect trap
This problem is not hard to detect manually. However this s the kind of issue that sneaks up on you. After each server maintenance, CMS update or server update you should recheck for the correct redirect. Check your server logs for http requests on your https website and filter on crawlers. You could also check this by manually changing https:// to http:// on your website. The MarketingTracer on-page SEO crawler is built to detect this crawl trap. We will notify you of incorrect redirects when we detect this issue.
How to fix the https / subdomain redirect trap
The source of this issue is a misconfiguration of your webserver /cms. Depending on 'what causes the redirect' you should edit your webserver configuration or your CMS to add the request uri to your redirect string.
2. Filter trap
Filters for products and sorting can generate huge amounts of urls. For example, sorting on price, popularity and filtering on size (s,m,l,xl,xxl) and color (8 colors) will generate 2*2*2*6*8 = 384 pages of duplicate content. Now multiple this by all your shop categories and any other filters you might use.
Usually we will advise you to prevent using query parameter (?sort=price) in your urls. But with a shopping page sorting and filtering is a must. That is why we have to tackle this issue a little differently.
When your site uses filters you are almost certainly vulnerable to the filter trap. Adding noindex tags, nofollow to links or canonicals to your pages will not prevent Google from trying to crawl all your filtered pages.
The problem with the filter trap
How to detect the filter trap
When your site uses filters you are almost certainly vulnerable to the filter trap. It's not a matter of yes or no but to which degree.
3. How to fix the filter trap
The best way to prevent the filter trap is to block filter results from google. First add a canonical url to your shop page indicating the 'correct' location for you shop/category/product page. Then add the filters to your robots.txt file like this:
Disallow: /*?*size= Disallow: /*?*sort=
Never-Ending URL trap
The never ending url trap occurs with a relative link to the wrong directory level. Instead of linking to '/page1' you link to 'page1' (not the slash in front of the first link).
<a href="page1">Page 1</a>
Repeatedly clicking this link will navigate you to
https://www.example.com/page1 https://www.example.com/page1/page1 https://www.example.com/page1/page1/page1
The problem with the never ending url trap
The never ending url trap quickly generated an infinite number of urls. the never ending url trap is hard to detect because Google will allmost never show the never ending url trap in the site command. Google, does keep trying to crawl the never ending urls at a slow pace.
How to detect the never ending url trap
The never ending url trap is difficult to detect manually. You will need to inspect the source of your page to detect the small omision of '/' in your link.
The MarketingTracer on-page SEO crawler is built to detect this crawl trap. Just check our crawl index and sort your page by url. You will quickly be able to find the mistake. From there analyse the page to view all the links to this page and fix them.
How to fix the never ending url trap
The never ending url trap is easy to fix. Locate the relative link and replace it with an absolute link (replace <a href="page1">Page 1</a> with <a href="/page1">Page 1</a>)
4. Time trap
Your calander plugin can generate pages infinitely into the future. The time trap is also referred to as the calander trap.
https://www.example.com/calendar/2019/01 // month 1 https://www.example.com/calendar/2019/02 // month 2 ... https://www.example.com/calendar/3019/01 // 1000 years into the future
The problem with the time trap
The time trap generates an unlimited amount of empty pages. While Google is pretty good at avoiding out the time trap it takes a while for Google to learn this for your site. In the mean time lots and lots of low quality pages will get crawled by Google.
How to detect the time trap
This problem a bit harder to detect manually. The site command (site:www.example.com/calendar) will give you an indication of the indexed pages of your calendar. However, once Google has detected the time trap it will quickly remove all irrelevant calendar pages from the index rendering the site command useless.
A manual inspection of your calendar plugin is the only way to check for this trap. First inspect your settings (are there options to avoid the time trap like limiting the number of months into the future). If not then check if the calendar pages in the distant future come with robot instructions (like <meta name="robots" content="noindex">)
The MarketingTracer on-page SEO crawler is built to detect this crawl trap. Just check our crawl index and filter on 'calendar' (or if your calendar plugin has a different name use that name)
How to fix the time trap
Fixing the time trap can be difficult since calendar software usually comes as a plugin. If the plugin does not have sufficient protection against the time trap you will need to block the calendar pages from the index in your robots.txt.
- Set the number of pages in the future to a reasonable amount
- No-following the links will NOT fix the issue
- Block the calendar pages in your robots.txt
5. Infinite redirect trap
https://www.example.com/page2 redirects to https://www.example.com/page2
The problem with the infinite redirect trap
Google understands infinite redirect and will stop crawling after it detects a loop. There still are 2 issue with infinite redirects. 1. They eat away your crawl budget. 2. Internal links to infinite redirects are a sign of poor quality.
How to detect the infinite redirect trap
Infinite redirects will give a 'redirect loop' error in your browser.
Infinite redirects are almost impossible to detect when they are tucked away somewhere deep in your website.
The MarketingTracer on-page SEO crawler is built to detect this crawl trap. Use the redirect filter to view these redirect loops.
How to fix the infinite redirect trap
Fixing the infinite redirect loop is easy. Just redirect the page to the correct page and you are done.
6. Session url trap
The problem with the session url trap
Most frameworks use sessions. Sessions are used to store vistor data for this visit only. Each session usually gets an unique id (12345abcde for example). Session data is normally stored in cookies. If for some reason like a mis configuration of the server the session data is not stored in a cookie the session id might get added to the url.
Each visit from a crawler constitutes as a 'new visit' and gets a new session id. The same url, crawled twice will get 2 different session id's and 2 different urls. Each time a crawler crawls a page all the links with the new session id will look like new pages resulting in a explosion of urls ready to crawl.
How to detect the session url trap
Detecting the session url trap is easy. Just visit your website, disable cookies and click a few links. If a session id appears in the url you are vulerable to the session url trap.
The MarketingTracer on-page SEO crawler is built to detect this crawl trap. Just check our crawl index and filter on 'session' and we will show you all the urls with session id's
How to fix the session url trap
Fixing the session trap is relatively easy. Usually a setting in your CMS will disable session id's in the url. Alternative you will need to reconfigure your webserver.