How to Disallow Google from Indexing Your HubSpot Gated Documents

Here’s a common scenario: you create a HubSpot landing page with a form. Once people fill in that form, they are redirected to a gated document (possibly an e-book or a whitepaper). The system is simple and works flawlessly, you are happy, and your client is happy! A few weeks later, your clients emails you and tells you that their PDF has been indexed in Google, and now it’s accessible for anyone, without even going through the HubSpot gate.

You do a research, and then you discover that you had to set the “File URL Visibility” to “Public – No-index” by doing the following:

  • Going to the “Marketing” -> “Files and Templates” -> “Files” in HubSpot
  • Clicking on the name of the file affected, e.g. client-file.pdf
  • Setting the “File URL Visibility” to “Public – No-index” in the “File details” side window

So you do that, and you think (and rightly so) that it’s probably too late for this particular file since it’s already indexed by Google, and you apologize to the client, and promise that “this won’t happen in the future”.

Another couple of weeks later, you are asked (by the same client) to host another gated asset (e.g. yet another whitepaper), and so you gladly do it, and you ensure that its visibility is set to “Public – No-index”. You assure the client that it won’t get indexed by Google because you “ensured it won’t”. Yet another couple of weeks later, you get an email from the client complaining that the document is indexed by Google and that people are able to access it directly. How could that happen when you ensured that its visibility is set “Public – No-index”?

You start thinking that you are missing something, maybe the problem is that Google is allowed to index those landing pages? You talk yourself into believing this twisted theory, and so you disallow Google from indexing the landing pages altogether (after all, 99% of the traffic to those landing pages comes from direct email marketing) by going to the “Settings” page in HubSpot, and then clicking on “Website” -> “Pages” on the left navigation bar, and then clicking on the “SEO & Crawlers” tab, and finally adding the following to the robots.txt file:

User-agent: *
Disallow: /

You feel considerably less confident this time, but you still inform the client that “you missed something” last time, and that the problem will not happen again. The client is quickly convinced (because of all of these years of building that precious trust with you), and asks you to host another gated content, which you do, and you ensure that the file’s visibility is set to “Public – No-index” and that Google isn’t even allowed to index any landing page.

5 days later, your left (or right, or both) eye immediately start twitching when you hear the phone ringing and you see the client’s number; it’s not time for another gated content so there must be something wrong. You answer the phone and your client suddenly has a completely different tone. You apologize for nearly 15 minutes and you say “Do not worry! I’m on it”, just before hanging up.

You start biting your nails/lips and you start apologizing to every single person you’ve hurt intentionally or unintentionally since 1st grade. You then go for a walk and are suddenly gentle and kind to everyone you see – maybe that would lift the curse?

But it’s not a curse, in fact, the problem is from both HubSpot and Google. The latter chooses to disrespect, more often than not, the robots.txt directives, and the former (HubSpot) is gullible enough to believe that Google will respect those directives.

So what’s the real solution here?

Well, the solution is to disallow Google (and other search engines) from indexing the assets, even if they wanted to, and that can only be done if the asset is hosted on a server you fully control, because it must be done at the .htaccess level.

In other words, you will need to go into one of your servers, create a directory, upload your assets (ultimately link to those assets from your HubSpot landing pages), and then create an .htaccess file with the following rules:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^(.*)googlebot(.*)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)bingbot(.*)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)alexa(.*)$ [NC]
RewriteRule .* - [F,L]

This should prevent Google from even attempting to index those PDFs. Adding a robots.txt file to that directory with the following code…

User-agent: *
Disallow: /

…should also help (it will prevent other bots from indexing the assets).

So there you go, the right way of hosting gated assets! If you need help with deploying the above, or for any other HubSpot help, then please to contact us. Our fees are super affordable, we are ready to help, and we have solid and proven HubSpot experience.

No comments yet.

Leave a comment