SBN

A Note to Our Community On How To Hide Your Content From Search Engines

Say your organization has done something pretty terrible. Terrible enough that really didn’t want to acknowledge it initially but eventually blogged about it, and haven’t added a blog post in a long time so that entry is at the top of your blog index page which Google can still index and will since it’s been linked to from this site which has a high rating internally in their massive database.

If you wanted to help ensure nobody finds that original page, there are lots of ways to do that.

First, you could add a Disallow entry in your robots.txt for it. Ironically, some organizations don’t go that route but do try to prevent Google (et al) from indexing their terms of use and privacy policy, which might suggest they don’t want to have a historical record that folks could compare changes to, and perhaps are even planning changes (might be good if more than just me saves off some copies of that now).

Now, robots.txt modifications are fairly straightforward. And, they are also super easy to check.

So, what if you wanted to hide your offense from Google (et al) and not make it obvious in your robots.txt? For that, you can use a special <meta> tag in the header of your site.

This is an example of what that looks like:

datacamp

but that may be hard to see, so let’s look at it up close:

<meta name="robots" content="noindex" class="next-head" /><title class="next-head">A note to our community (article) - DataCamp</title><link rel="canonical" href="https://www.datacamp.com/community/blog/note-to-our-community" class="next-head" /><meta property="og:url" content="https://www.datacamp.com/community/blog/note-to-our-community" class="next-head" />

That initial <meta> tag will generally be respected by all search engines.

And, if you want to really be sneaky, you can add a special X-Robots-Tag: noindex HTTP header to your web server for any page you want to have no permanent record of and sneak past even more eyes.

Unfortunately, some absolute novices who did know how to do the <meta> tag trick aren’t bright enough to do the sneakier version and get caught. Here’s an example of a site that doesn’t use the super stealthy header approach:

datacamp

FIN

So, if you’re going to be childish and evil, now you know what you really should do to try to keep things out of public view.

Also, if you’re one of the folks who likes to see justice be done, you now know where to check and can use this R snippet to do so whenever you like. Just substitute the randomly chosen site/page below for one that you want to monitor.

library(httr)library(xml2)httr::GET(  url = "https://www.datacamp.com/community/blog/note-to-our-community") -> resdata.frame(  name = names(res$all_headers[[1]]$headers), # if there are more than one set (i.e. redirects) you'll need to iterate  value = unlist(res$all_headers[[1]]$headers, use.names = FALSE)) -> hdrshdrs[hdrs[["name"]] == "robots",]## [1] name  value## <0 rows> (or 0-length row.names)httr::content(res) %>%   xml_find_all(".//meta[@name='robots']")## {xml_nodeset (1)}## [1] <meta name="robots" content="noindex" class="next-head">\nreadLines("https://www.datacamp.com/robots.txt")## [1] "User-Agent: *"                                                              ## [2] "Disallow: /users/auth/linkedin/callback"                                    ## [3] "Disallow: /terms-of-use"                                                    ## [4] "Disallow: /privacy-policy"                                                  ## [5] "Disallow: /create/how"                                                      ## [6] "Sitemap: http://assets.datacamp.com/sitemaps/main/production/sitemap.xml.gz"

Thank you for reading to the end of this note to our community.


*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2019/04/12/a-note-to-our-community-on-how-to-hide-your-content-from-search-engines/