Web crawlers (also called spiders or bots) are programs that visit (or “crawl”) pages across the web.
And search engines use crawlers to discover content that they can then index—meaning store in their enormous databases.
These programs discover your content by following links on your site.
But the process doesn’t always go smoothly because of crawl errors.
Before we dive into these errors and how to address them, let’s start with the basics.
What Are Crawl Errors?
Crawl errors occur when search engine crawlers can’t navigate through your webpages the way they normally do (shown below).
When this occurs, search engines like Google can’t fully explore and understand your website’s content or structure.
This is a problem because crawl errors can prevent your pages from being discovered. Which means they can’t be indexed, appear in search results, or drive organic (unpaid) traffic to your site.
Google separates crawl errors into two categories: site errors and URL errors.
Let’s explore both.
Site Errors
Site errors are crawl errors that can impact your whole website.
Server, DNS, and robots.txt errors are the most common.
Server Errors
Server errors (which return a 5xx HTTP status code) happen when the server prevents the page from loading.
Here are the most common server errors:
- Internal server error (500): The server can’t complete the request. But it can also be triggered when more specific errors aren’t available.
- Bad gateway error (502): One server acts as a gateway and receives an invalid response from another server
- Service not available error (503): The server is currently unavailable, usually when the server is under repair or being updated
- Gateway timeout error (504): One server acts as a gateway and doesn’t receive a response from another server in time. Like when there’s too much traffic on the website.
When search engines constantly encounter 5xx errors, they can slow a website’s crawling rate.
That means search engines like Google might be unable to discover and index all your content.
DNS Errors
A domain name system (DNS) error is when search engines can’t connect with your domain.
All websites and devices have at least one internet protocol (IP) address uniquely identifying them on the web.
The DNS makes it easier for people and computers to talk to each other by matching domain names to their IP addresses.
Without the DNS, we would manually input a website’s IP address instead of typing its URL.
So, instead of entering “www.semrush.com” in your URL bar, you would have to use our IP address: “34.120.45.191.”
DNS errors are less common than server errors. But here are the ones you might encounter:
- DNS timeout: Your DNS server didn’t reply to the search engine’s request in time
- DNS lookup: The search engine couldn’t reach your website because your DNS server failed to locate your domain name
Robots.txt Errors
Robots.txt errors arise when search engines can’t retrieve your robots.txt file.
Your robots.txt file tells search engines which pages they can crawl and which they can’t.
Here’s what a robots.txt file looks like.
Here are the three main parts of this file and what each does:
- User-agent: This line identifies the crawler. And “*” means that the rules are for all search engine bots.
- Disallow/allow: This line tells search engine bots whether they should crawl your website or certain sections of your website
- Sitemap: This line indicates your sitemap location
URL Errors
Unlike site errors, URL errors only affect the crawlability of specific pages on your site.
Here’s an overview of the different types:
404 Errors
A 404 error means that the search engine bot couldn’t find the URL. And it’s one of the most common URL errors.
It happens when:
- You’ve changed the URL of a page without updating old links pointing to it
- You’ve deleted a page or article from your site without adding a redirect
- You have broken links–e.g., there are errors in the URL
Here’s what a basic 404 page looks like on an Nginx server.
But most companies use custom 404 pages today.
These custom pages improve the user experience. And allow you to remain consistent with your website’s design and branding.
Soft 404 Errors
Soft 404 errors happen when the server returns a 200 code but Google thinks it should be a 404 error.
The 200 code means everything is OK. It’s the expected HTTP response code if there are no issues
So, what causes soft 404 errors?
- JavaScript file issue: The JavaScript resource is blocked or can’t be loaded
- Thin content: The page has insufficient content that doesn’t provide enough value to the user. Like an empty internal search result page.
- Low-quality or duplicate content: The page isn’t useful to users or is a copy of another page. For example, placeholder pages that shouldn’t be live like those that contain “lorem ipsum” content. Or duplicate content that doesn’t use canonical URLs—which inform search engines which page is the primary one.
- Other reasons: Missing files on the server or a broken connection to your database
Here’s what you see in Google Search Console (GSC) when you find pages with these.
403 Forbidden Errors
The 403 forbidden error means the server denied a crawler’s request. Meaning the server understood the request, but the crawler isn’t able to access the URL.
Here’s what a 403 forbidden error looks like on an Nginx server.
Problems with server permissions are the main reasons behind the 403 error.
Server permissions define user and admins’ rights on a folder or file.
We can divide the permissions into three categories: read, write, and execute.
For example, you won’t be able to access a URL If you don’t have the read permission.
A faulty .htaccess file is another recurring cause of 403 errors.
An .htaccess file is a configuration file used on Apache servers. It’s helpful for configuring settings and implementing redirects.
But any error in your .htaccess file can result in issues like a 403 error.
Redirect Loops
A redirect loop happens when page A redirects to page B. And page B to page A.
The result?
An infinite loop of redirects that prevents visitors and crawlers from accessing your content. Which can hinder your rankings.
How to Find Crawl Errors
Site Audit
Semrush’s Site Audit allows you to easily discover issues affecting your site’s crawlability. And provides suggestions on how to address them.
Open the tool, enter your domain name, and click “Start Audit.”
Then, follow the Site Audit configuration guide to adjust your settings. And click “Start Site Audit.”
You’ll be taken to the “Overview” report.
Click on “View details” in the “Crawlability” module under “Thematic Reports.”
You’ll get an overall understanding of how you’re doing in terms of crawl errors.
Then, select a specific error you want to solve. And click on the corresponding bar next to it in the “Crawl Budget Waste” module.
We’ve chosen the 4xx for our example.
On the next screen, click “Why and how to fix it.”
You’ll get information required to understand the issue. And advice on how to solve it.
Google Search Console
Google Search Console is also an excellent tool offering valuable help to identify crawl errors.
Head to your GSC account and click on “Settings” on the left sidebar.
Then, click on “OPEN REPORT” next to the “Crawl stats” tab.
Scroll down to see if Google noticed crawling issues on your site.
Click on any issue, like the 5xx server errors.
You’ll see the full list of URLs matching the error you selected.
Now, you can address them one by one.
How to Fix Crawl Errors
We now know how to identify crawl errors.
The next step is better understanding how to fix them.
Fixing 404 Errors
You’ll probably encounter 404 errors frequently. And the good news is they’re easy to fix.
You can use redirects to fix 404 errors.
Use 301 redirects for permanent redirects because they allow you to retain some of the original page’s authority. And use 302 redirects for temporary redirects.
How do you choose the destination URL for your redirects?
Here are some best practices:
- Add a redirect to the new URL if the content still exists
- Add a redirect to a page addressing the same or a highly similar topic if the content no longer exists
There are three main ways to deploy redirects.
The first method is to use a plugin.
Here are some of the most popular redirect plugins for WordPress:
The second method is to add redirects directly on your server configuration file.
Here’s what a 301 redirect would look like on an .htaccess file on an Apache server.
Redirect 301 https://www.yoursite.com/old-page/ https://www.yoursite.com/new-page/
You can break this line down into four parts:
- Redirect: Specifies that we want to redirect the traffic
- 301: Indicates the redirect code, stating that it’s a permanent redirect
- https://www.yoursite.com/old-page/: Identifies the URL to redirect from
- https://www.yoursite.com/new-page/: Identifies the URL to redirect to
We don’t recommend this option if you’re a beginner. Because it can negatively impact your site if you’re unsure of what you’re doing. So, make sure to work with a developer if you opt to go this route.
Finally, you can add redirects directly from the backend if you use Wix or Shopify.
If you’re using Wix, scroll to the bottom of your website control panel. Then click on “SEO” under “Marketing & SEO.”
Click “Go to URL Redirect Manager” located under the “Tools and settings” section.
Then, click the “+ New Redirect” button at the top right corner.
A pop-up window will show. Here, you can choose the type of redirect, enter the old URL you want to redirect from, and the new URL you want to direct to.
Here are the steps to follow if you’re using Shopify:
Log into your account and click on “Online Store” under “Sales channels.”
Then, select “Navigation.”
From here, go to “View URL Redirects.”
Click the “Create URL redirect” button.
Enter the old URL that you wish to redirect visitors from and the new URL that you want to redirect your visitors to. “Enter “/” to target your store’s home page.)
Finally, save the redirect.
Broken links (links that point to pages that can’t be found) can also be a reason behind 404 errors. So, let’s see how we can quickly identify broken links with the Site Audit tool and fix them.
Fixing Broken Links
A broken link points to a page or resource that doesn’t exist.
Let’s say you’ve been working on a new article and want to add an internal link to your about page at “yoursite.com/about.”
Any typos on your link will create broken links.
So, you’ll get a broken link error if you’ve forgotten the letter “b” and input “yoursite.com/aout” instead of “yoursite.com/about.”
Broken links can be either internal (pointing to another page on your site) or external (pointing to another website).
To find broken links, configure Site Audit if you haven’t yet.
Then, go to the “Issues” tab.
Now, type “internal links” in the search bar at the top of the table to find issues related to broken links.
And click on the blue, clickable text in the issue to see the complete list of affected URLs.
To fix these, change the link, restore the missing page, or add a 301 redirect to another relevant page on your site.
Fixing Robots.txt Errors
Semrush’s Site Audit tool can also help you resolve issues regarding your robots.txt file.
First, set up a project in the tool and run your audit.
Once complete, navigate to the “Issues” tab and search for “robots.txt.”
You’ll now see any issues related to your robots.txt file that you can click on. For example, you might see a “Robots.txt file has format errors” link if it turns out that your file has format errors.
Go ahead and click the blue, clickable text.
And you’ll see a list of invalid lines in the file.
You can click “Why and how to fix it” to get specific instructions on how to fix the error.
Monitor Crawlability to Ensure Success
To make sure your site can be crawled (and indexed and ranked), you should first make it search engine-friendly.
Your pages might not show up in search results if it isn’t. So, you won’t drive any organic traffic.
Finding and fixing problems with crawlability and indexability is easy with the Site Audit tool.
You can even set it up to crawl your site automatically on a recurring basis. To ensure you stay aware of any crawl errors that need to be addressed.
Source link : Semrush.com