Archive for the ‘crawling and indexing’ Category

Updating our technical Webmaster Guidelines

Webmaster level: All

We recently announced that our indexing system has been rendering web pages more like a typical modern browser, with CSS and JavaScript turned on. Today, we’re updating one of our technical Webmaster Guidelines in light of this announcement.

For optimal rendering and indexing, our new guideline specifies that you should allow Googlebot access to the JavaScript, CSS, and image files that your pages use. This provides you optimal rendering and indexing for your site. Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings.

Updated advice for optimal indexing

Historically, Google indexing systems resembled old text-only browsers, such as Lynx, and that’s what our Webmaster Guidelines said. Now, with indexing based on page rendering, it’s no longer accurate to see our indexing systems as a text-only browser. Instead, a more accurate approximation is a modern web browser. With that new perspective, keep the following in mind:

  • Just like modern browsers, our rendering engine might not support all of the technologies a page uses. Make sure your web design adheres to the principles of progressive enhancement as this helps our systems (and a wider range of browsers) see usable content and basic functionality when certain web design features are not yet supported.
  • Pages that render quickly not only help users get to your content easier, but make indexing of those pages more efficient too. We advise you follow the best practices for page performance optimization, specifically:
  • Make sure your server can handle the additional load for serving of JavaScript and CSS files to Googlebot.

Testing and troubleshooting

In conjunction with the launch of our rendering-based indexing, we also updated the Fetch and Render as Google feature in Webmaster Tools so webmasters could see how our systems render the page. With it, you’ll be able to identify a number of indexing issues: improper robots.txt restrictions, redirects that Googlebot cannot follow, and more.

And, as always, if you have any comments or questions, please ask in our Webmaster Help forum.

Posted by Pierre Far, Webmaster Trends Analyst

Best practices for XML sitemaps & RSS/Atom feeds

Webmaster level: intermediate-advanced

Submitting sitemaps can be an important part of optimizing websites. Sitemaps enable search engines to discover all pages on a site and to download them quickly when they change. This blog post explains which fields in sitemaps are important, when to use XML sitemaps and RSS/Atom feeds, and how to optimize them for Google.

Sitemaps and feeds

Sitemaps can be in XML sitemap, RSS, or Atom formats. The important difference between these formats is that XML sitemaps describe the whole set of URLs within a site, while RSS/Atom feeds describe recent changes. This has important implications:

  • XML sitemaps are usually large; RSS/Atom feeds are small, containing only the most recent updates to your site.
  • XML sitemaps are downloaded less frequently than RSS/Atom feeds.

For optimal crawling, we recommend using both XML sitemaps and RSS/Atom feeds. XML sitemaps will give Google information about all of the pages on your site. RSS/Atom feeds will provide all updates on your site, helping Google to keep your content fresher in its index. Note that submitting sitemaps or feeds does not guarantee the indexing of those URLs.

Example of an XML sitemap:

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
   <loc>http://example.com/mypage</loc>
   <lastmod>2011-06-27T19:34:00+01:00</lastmod>
   <!-- optional additional tags -->
 </url>
 <url>
   ...
 </url>
</urlset>

Example of an RSS feed:

<?xml version="1.0" encoding="utf-8"?>
<rss>
 <channel>
   <!-- other tags -->
   <item>
     <!-- other tags -->
     <link>http://example.com/mypage</link>
     <pubDate>Mon, 27 Jun 2011 19:34:00 +0100</pubDate>
   </item>
   <item>
     ...
   </item>
 </channel>
</rss>

Example of an Atom feed:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 <!-- other tags -->
 <entry>
   <link href="http://example.com/mypage" />
   <updated>2011-06-27T19:34:00+01:00</updated>
   <!-- other tags -->
 </entry>
 <entry>
   ...
 </entry>
</feed>

“other tags” refer to both optional and required tags by their respective standards. We recommend that you specify the required tags for Atom/RSS as they will help you to appear on other properties that might use these feeds, in addition to Google Search.

Best practices

Important fields

XML sitemaps and RSS/Atom feeds, in their core, are lists of URLs with metadata attached to them. The two most important pieces of information for Google are the URL itself and its last modification time:

URLs

URLs in XML sitemaps and RSS/Atom feeds should adhere to the following guidelines:

  • Only include URLs that can be fetched by Googlebot. A common mistake is including URLs disallowed by robots.txt — which cannot be fetched by Googlebot, or including URLs of pages that don’t exist.
  • Only include canonical URLs. A common mistake is to include URLs of duplicate pages. This increases the load on your server without improving indexing.
Last modification time

Specify a last modification time for each URL in an XML sitemap and RSS/Atom feed. The last modification time should be the last time the content of the page changed meaningfully. If a change is meant to be visible in the search results, then the last modification time should be the time of this change.

  • XML sitemap uses  <lastmod>
  • RSS uses <pubDate>
  • Atom uses <updated>

Be sure to set or update last modification time correctly:

  • Specify the time in the correct format: W3C Datetime for XML sitemaps, RFC3339 for Atom and RFC822 for RSS.
  • Only update modification time when the content changed meaningfully.
  • Don’t set the last modification time to the current time whenever the sitemap or feed is served.

XML sitemaps

XML sitemaps should contain URLs of all pages on your site. They are often large and update infrequently. Follow these guidelines:

  • For a single XML sitemap: update it at least once a day (if your site changes regularly) and ping Google after you update it.
  • For a set of XML sitemaps: maximize the number of URLs in each XML sitemap. The limit is 50,000 URLs or a maximum size of 10MB uncompressed, whichever is reached first. Ping Google for each updated XML sitemap (or once for the sitemap index, if that’s used) every time it is updated. A common mistake is to put only a handful of URLs into each XML sitemap file, which usually makes it harder for Google to download all of these XML sitemaps in a reasonable time.

RSS/Atom

RSS/Atom feeds should convey recent updates of your site. They are usually small and updated frequently. For these feeds, we recommend:

  • When a new page is added or an existing page meaningfully changed, add the URL and the modification time to the feed.
  • In order for Google to not miss updates, the RSS/Atom feed should have all updates in it since at least the last time Google downloaded it. The best way to achieve this is by using PubSubHubbub. The hub will propagate the content of your feed to all interested parties (RSS readers, search engines, etc.) in the fastest and most efficient way possible.

Generating both XML sitemaps and Atom/RSS feeds is a great way to optimize crawling of a site for Google and other search engines. The key information in these files is the canonical URL and the time of the last modification of pages within the website. Setting these properly, and notifying Google and other search engines through sitemaps pings and PubSubHubbub, will allow your website to be crawled optimally, and represented accordingly in search results.

If you have any questions, feel free to post them here, or to join other webmasters in the webmaster help forum section on sitemaps.

Posted by Alkis Evlogimenos, Google Feeds Team

An improved search box within the search results

Webmaster level: All

Today you’ll see a new and improved sitelinks search box. When shown, it will make it easier for users to reach specific content on your site, directly through your own site-search pages.

What’s this search box and when does it appear for my site?

When users search for a company by name—for example, [Megadodo Publications] or [Dunder Mifflin]—they may actually be looking for something specific on that website. In the past, when our algorithms recognized this, they’d display a larger set of sitelinks and an additional search box below that search result, which let users do site: searches over the site straight from the results, for example [site:example.com hitchhiker guides].

This search box is now more prominent (above the sitelinks), supports Autocomplete, and—if you use the right markup—will send the user directly to your website’s own search pages.

How can I mark up my site?

You need to have a working site-specific search engine for your site. If you already have one, you can let us know by marking up your homepage as a schema.org/WebSite entity with the potentialAction property of the schema.org/SearchAction markup. You can use JSON-LD, microdata, or RDFa to do this; check out the full implementation details on our developer site.

If you implement the markup on your site, users will have the ability to jump directly from the sitelinks search box to your site’s search results page. If we don’t find any markup, we’ll show them a Google search results page for the corresponding site: query, as we’ve done until now.
As always, if you have questions, feel free to ask in our Webmaster Help forum.

Posted by Mariya Moeva, Webmaster Trends Analyst, and Kaylin Spitz, Software Engineer

Testing robots.txt files made easier

Webmaster level: intermediate-advanced

To crawl, or not to crawl, that is the robots.txt question.

Making and maintaining correct robots.txt files can sometimes be difficult. While most sites have it easy (tip: they often don’t even need a robots.txt file!), finding the directives within a large robots.txt file that are or were blocking individual URLs can be quite tricky. To make that easier, we’re now announcing an updated robots.txt testing tool in Webmaster Tools.

You can find the updated testing tool in Webmaster Tools within the Crawl section:

Here you’ll see the current robots.txt file, and can test new URLs to see whether they’re disallowed for crawling. To guide your way through complicated directives, it will highlight the specific one that led to the final decision. You can make changes in the file and test those too, you’ll just need to upload the new version of the file to your server afterwards to make the changes take effect. Our developers site has more about robots.txt directives and how the files are processed.

Additionally, you’ll be able to review older versions of your robots.txt file, and see when access issues block us from crawling. For example, if Googlebot sees a 500 server error for the robots.txt file, we’ll generally pause further crawling of the website.

Since there may be some errors or warnings shown for your existing sites, we recommend double-checking their robots.txt files. You can also combine it with other parts of Webmaster Tools: for example, you might use the updated Fetch as Google tool to render important pages on your website. If any blocked URLs are reported, you can use this robots.txt tester to find the directive that’s blocking them, and, of course, then improve that. A common problem we’ve seen comes from old robots.txt files that block CSS, JavaScript, or mobile content — fixing that is often trivial once you’ve seen it.

We hope this updated tool makes it easier for you to test & maintain the robots.txt file. Should you have any questions, or need help with crafting a good set of directives, feel free to drop by our webmaster’s help forum!

Posted by Asaph Arnon, Webmaster Tools team

Android app indexing is now open for everyone!

Webmaster level: All

Do you have an Android app in addition to your website? You can now connect the two so that users searching from their smartphones and tablets can easily find and reach your app content.

App deep links in search results help your users find your content more easily and re-engage with your app after they’ve installed it. As a site owner, you can show your users the right content at the right time — by connecting pages of your website to the relevant parts of your app you control when your users are directed to your app and when they go to your website.

Hundreds of apps have already implemented app indexing. This week at Google I/O, we’re announcing a set of new features that will make it even easier to set up deep links in your app, connect your site to your app, and keep track of performance and potential errors.

Getting started is easy

We’ve greatly simplified the process to get your app deep links indexed. If your app supports HTTP deep linking schemes, here’s what you need to do:

  1. Add deep link support to your app
  2. Connect your site and your app
  3. There is no step 3 (:

As we index your URLs, we’ll discover and index the app / site connections and may begin to surface app deep links in search results.

We can discover and index your app deep links on our own, but we recommend you publish the deep links. This is also the case if your app only supports a custom deep link scheme. You can publish them in one of the following ways:

There’s one more thing: we’ve added a new feature in Webmaster Tools to help you debug any issues that might arise during app indexing. It will show you what type of errors we’ve detected for the app page-web page pairs, together with example app URIs so you can debug:

We’ll also give you detailed instructions on how to debug each issue, including a QR code for the app deep links, so you can easily open them on your phone or tablet. We’ll send you Webmaster Tools error notifications as well, so you can keep up to date.

Give app indexing a spin, and as always, if you need more help ask questions on the Webmaster help forum.

Posted by Mariya Moeva, Webmaster Trends Analyst

Directing smartphone users to the page they actually wanted

Webmaster level: all

Have you ever used Google Search on your smartphone and clicked on a promising-looking result, only to end up on the mobile site’s homepage, with no idea why the page you were hoping to see vanished? This is such a common annoyance that we’ve even seen comics about it. Usually this happens because the website is not properly set up to handle requests from smartphones and sends you to its smartphone homepage—we call this a “faulty redirect”.

We’d like to spare users the frustration of landing on irrelevant pages and help webmasters fix the faulty redirects. Starting today in our English search results in the US, whenever we detect that smartphone users are redirected to a homepage instead of the the page they asked for, we may note it below the result. If you still wish to proceed to the page, you can click “Try anyway”:

And we’re providing advice and resources to help you direct your audience to the pages they want. Here’s a quick rundown:

1. Do a few searches on your own phone (or with a browser set up to act like a smartphone) and see how your site behaves. Simple but effective. 🙂

2. Check out Webmaster Tools—we’ll send you a message if we detect that any of your site’s pages are redirecting smartphone users to the homepage. We’ll also show you any faulty redirects we detect in the Smartphone Crawl Errors section of Webmaster Tools:

3. Investigate any faulty redirects and fix them. Here’s what you can do:

  • Use the example URLs we provide in Webmaster Tools as a starting point to debug exactly where the problem is with your server configuration.
  • Set up your server so that it redirects smartphone users to the equivalent URL on your smartphone site.
  • If a page on your site doesn’t have a smartphone equivalent, keep users on the desktop page, rather than redirecting them to the smartphone site’s homepage. Doing nothing is better than doing something wrong in this case.
  • Try using responsive web design, which serves the same content for desktop and smartphone users.

If you’d like to know more about building smartphone-friendly sites, read our full recommendations. And, as always, if you need more help you can ask a question in our webmaster forum.

Posted by , Webmaster Trends Analyst

Rendering pages with Fetch as Google

Webmaster level: all

The Fetch as Google feature in Webmaster Tools provides webmasters with the results of Googlebot attempting to fetch their pages. The server headers and HTML shown are useful to diagnose technical problems and hacking side-effects, but sometimes make double-checking the response hard: Help! What do all of these codes mean? Is this really the same page as I see it in my browser? Where shall we have lunch? We can’t help with that last one, but for the rest, we’ve recently expanded this tool to also show how Googlebot would be able to render the page.

Viewing the rendered page

In order to render the page, Googlebot will try to find all the external files involved, and fetch them as well. Those files frequently include images, CSS and JavaScript files, as well as other files that might be indirectly embedded through the CSS or JavaScript. These are then used to render a preview image that shows Googlebot’s view of the page.

You can find the Fetch as Google feature in the Crawl section of Google Webmaster Tools. After submitting a URL with “Fetch and render,” wait for it to be processed (this might take a moment for some pages). Once it’s ready, just click on the response row to see the results.

Fetch as Google

Handling resources blocked by robots.txt

Googlebot follows the robots.txt directives for all files that it fetches. If you are disallowing crawling of some of these files (or if they are embedded from a third-party server that’s disallowing Googlebot’s crawling of them), we won’t be able to show them to you in the rendered view. Similarly, if the server fails to respond or returns errors, then we won’t be able to use those either (you can find similar issues in the Crawl Errors section of Webmaster Tools). If we run across either of these issues, we’ll show them below the preview image.

We recommend making sure Googlebot can access any embedded resource that meaningfully contributes to your site’s visible content, or to its layout. That will make Fetch as Google easier for you to use, and will make it possible for Googlebot to find and index that content as well. Some types of content – such as social media buttons, fonts or website-analytics scripts – tend not to meaningfully contribute to the visible content or layout, and can be left disallowed from crawling. For more information, please see our previous blog post on how Google is working to understand the web better.

We hope this update makes it easier for you to diagnose these kinds of issues, and to discover content that’s accidentally blocked from crawling. If you have any comments or questions, let us know here or drop by in the webmaster help forum.

Posted by Shimi Salant, Webmaster Tools team

Understanding web pages better


In 1998 when our servers were running in Susan Wojcicki’s garage, we didn’t really have to worry about JavaScript or CSS. They weren’t used much, or, JavaScript was used to make page elements… blink! A lot has changed since then. The web is full of rich, dynamic, amazing websites that make heavy use of JavaScript. Today, we’ll talk about our capability to render richer websites — meaning we see your content more like modern Web browsers, include the external resources, execute JavaScript and apply CSS.

Traditionally, we were only looking at the raw textual content that we’d get in the HTTP response body and didn’t really interpret what a typical browser running JavaScript would see. When pages that have valuable content rendered by JavaScript started showing up, we weren’t able to let searchers know about it, which is a sad outcome for both searchers and webmasters.

In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.

Sometimes things don’t go perfectly during rendering, which may negatively impact search results for your site. Here are a few potential issues, and – where possible, – how you can help prevent them from occurring:

  • If resources like JavaScript or CSS in separate files are blocked (say, with robots.txt) so that Googlebot can’t retrieve them, our indexing systems won’t be able to see your site like an average user. We recommend allowing Googlebot to retrieve JavaScript and CSS so that  your content can be indexed better. This is especially important for mobile websites, where external resources like CSS and JavaScript help our algorithms understand that the pages are optimized for mobile.
  • If your web server is unable to handle the volume of crawl requests for resources, it may have a negative impact on our capability to render your pages. If you’d like to ensure that your pages can be rendered by Google, make sure your servers are able to handle crawl requests for resources.
  • It’s always a good idea to have your site degrade gracefully. This will help users enjoy your content even if their browser doesn’t have compatible JavaScript implementations. It will also help visitors with JavaScript disabled or off, as well as search engines that can’t execute JavaScript yet.
  • Sometimes the JavaScript may be too complex or arcane for us to execute, in which case we can’t render the page fully and accurately.
  • Some JavaScript removes content from the page rather than adding, which prevents us from indexing the content.


To make things easier to debug, we’re currently working on a tool for helping webmasters better understand how Google renders their site. We look forward to making it to available for you in the coming days in Webmaster Tools.

If you have any questions, please feel free to visit our help forum.

Posted by Erik Hendriks and Michael Xu, Software Engineers, and Kazushi Nagayama, Webmaster Trends Analyst

Creating the Right Homepage for your International Users


If you are doing business in more than one country or targeting different languages, we recommend having separate sites or sections with specific content on each URLs targeted for individual countries or languages. For instance one page for US and english-speaking visitors, and a different page for France and french-speaking users. While we have information on handling multi-regional and multilingual sites, the homepage can be a bit special. This post will help you create the right homepage on your website to serve the appropriate content to users depending on their language and location.

There are three ways to configure your homepage / landing page when your users access it:

  • Show everyone the same content.
  • Let users choose.
  • Serve content depending on users’ localization and language.
Let’s have a look at each in detail.

Show users worldwide the same content 

In this scenario, you decide to serve specific content for one given country and language on your homepage / generic URL (http://www.example.com). This content will be available to anyone who accesses that URL directly in their browser or those who search for that URL specifically. As mentioned above, all country & language versions should also be accessible on their own unique URLs.

Note: You can show a banner on your page to suggest a more appropriate version to users from other locations or with different language settings.

Let users choose which local version and which language they want 

In this configuration, you decide to serve a country selector page on your homepage / generic URL and to let users choose which content they want to see depending on country and language. All users who type in that URL can access the same page.
If you implement this scenario on your international site, remember to use the x-default rel-alternate-hreflang annotation for the country selector page, which was specifically created for these kinds of pages. The x-default value helps us recognize pages that are not specific to one language or region.

Automatically redirect users or dynamically serve the appropriate HTML content depending on users’ location and language settings

A third scenario would be to automatically serve the appropriate HTML content to your users depending on their location and language settings. You will either do that by using server-side 302 redirects or by dynamically serving the right HTML content.
Remember to use x-default rel-alternate-hreflang annotation on the homepage / generic page even if the latter is a redirect page that is not accessible directly for users.
Note: Think about redirecting users for whom you do not have a specific version. For instance, French-speaking users on a website that has English, Spanish and Chinese versions. Show them the content that you consider the most appropriate.
Whatever configuration you decide to go with, you should make sure all the pages – including country and language selector pages:
  • Have rel-alternate-hreflang annotations.
  • Are accessible for Googlebot’s crawling and indexing: do not block the crawling or indexing of your localized pages.
  • Always allow users to switch local version or language: you can do that using a drop down menu for instance.
Reminder: As mentioned in the beginning, remember that you must have separate URLs for each country and language version. 

About rel-alternate-hreflang annotations

Remember to annotate all your pages – whatever method you choose. This will greatly help search engines to show the right results to your users.
Country selector pages and redirecting or dynamically serving homepages should all use the x-default hreflang, which was specifically designed for auto-redirecting homepages and country selectors. 
Finally, here are a few useful reminders about rel-alternate-hreflang annotations in general:
  • Your annotations must be confirmed from the other pages. If page A links to page B, page B must link back to page A, otherwise, your annotations may not be interpreted correctly.
  • Your annotations should be self-referential. Page A should use rel-alternate-hreflang annotation linking to itself.
  • You can specify the rel-alternate-hreflang annotations in the HTTP header, in the head section of the HTML, or in a sitemap file. We strongly recommend that you choose only one way to implement the annotations, in order to avoid inconsistent signals and errors.
  • The value of the hreflang attribute must be in ISO 639-1 format for the language, and in ISO 3166-1 Alpha 2 format for the region. Specifying only the region is not supported. If you wish to configure your site only for a country, use the geotargeting feature in Webmaster Tools
Following these recommendations will help us better understand your localized content and serve more relevant results to your users in our search results. As always, if you have any questions or feedback, please tell us in the internationalization Webmaster Help Forum.
Posted by Zineb Ait Bahajji and Gary Illyes, Webmaster Trends Analysts.

App Indexing updates

Webmaster Level: Advanced

In October, we announced guidelines for App Indexing for deep linking directly from Google Search results to your Android app. Thanks to all of you that have expressed interest. We’ve just enabled 20+ additional applications that users will soon see app deep links for in Search Results, and starting today we’re making app deep links to English content available globally.

We’re continuing to onboard more publishers in all languages. If you haven’t added deep link support to your Android app or specified these links on your website or in your Sitemaps, please do so and then notify us by filling out this form.

Here are some best practices to consider when adding deep links to your sitemap or website:

  • App deep links should only be included for canonical web URLs.
  • Remember to specify an app deep link for your homepage.
  • Not all website URLs in a Sitemap need to have a corresponding app deep link. Do not include app deep links that aren’t supported by your app.
  • If you are a news site and use News Sitemaps, be sure to include your deep link annotations in the News Sitemaps, as well as your general Sitemaps.
  • Don’t provide annotations for deep links that execute native ARM code. This enables app indexing to work for all platforms

When Google indexes content from your app, your app will need to make HTTP requests that it usually makes under normal operation. These requests will appear to your servers as originating from Googlebot. Therefore, your server’s robots.txt file must be configured properly to allow these requests.

Finally, please make sure the back button behavior of your app leads directly back to the search results page.

For more details on implementation, visit our updated developer guidelines. And, as always, you can ask questions on the mobile section of our webmaster forum.

Posted by Michael Xu, Software Engineer