|
|
|
Google SiteMaps and You
Last week,
we looked at the recent news that Microsoft had decided to embrace
RSS in a big way in its upcoming releases of Internet Explorer and
Windows "Longhorn" and determined that this was a Good Thing. This week,
we're taking a look at implementing Google Sitemaps, a similar
technology developed by Google in order to help you define your site
more effectively to the search-engine behemoth. This is not a ticket to
a higher Google ranking (at least not that we know about); but it is a
useful tool that lets you apply RSS-like control to your website's
interactions with the Googlebot.
RSS (Really Simple Syndication) is the current heavyweight of so-called
"disruptive technologies" (loosely defined as those that have the
effect, if not developed with the intention, of changing the way we use
technology in general) and its use is skyrocketing among content
providers looking for a way to get their content in front of more eyes
and ears. But RSS originally stood for Rich Site Summary, a standard way
of cataloging your site's content for third-party aggregators.
Google Sitemaps have a similar function, in that they are an XML-based
way to describe website content in a standard, predictable way; but they
differ in that Sitemaps are intended for the Googlebot's eyes only,
rather than for any third-party. Think of them as an automated way to
make sure Google knows about your site's content (please note, however,
that Google does not guarantee inclusion of your content based solely on
the presence of a Sitemap file).
This sounds like a very specific undertaking, but the importance of
Google to getting your site's content noticed can simply not be
overstated. And with Google's expanding reach into more and more areas
of Web content presentation, chances are that you can be assured that
the information your Sitemap provides will eventually find some use you
haven't yet thought about. That's what disruptive technology is all
about, and Google has become one of the more innovative champions of
such technological advances.
Where To Start
The first thing you should do as a website developer is create a Google
Account for yourself or your company. This will allow you to do other
things besides access the Sitemaps infrastructure; but we'll leave that
for another day. Create the
account here and then proceed to the Sitemaps area at
this link. Once you've logged in, you'll see the sparse Sitemaps
interface. Don't be fooled, however, because like the simple interface
to its search engine, this one hides quite a bit of information
regarding the creation and use of Sitemaps, presenting it in digestible
bites as you walk through the process.
There's probably more there than you need to know at this point,
provided you don't have a huge site with a need for multiple Sitemaps
and so on. But if you do have such a site, the information is there for
creating truly complex Sitemaps and Sitemap Indices referencing many
Sitemaps and you can familiarize yourself with that as needed. For now,
we'll concentrate on what's required to establish a Sitemap for our site
at Cafe ID.
Like creating RSS feeds, creating a Google Sitemap is as simple as
putting together an XML file at the root level of your site that
describes the site according to the instructions that Google has laid
out. You can use any text editor for this purpose, but some editors do a
better job of helping you create properly formatted XML files. We
heartily recommend two that cost money,
BBEdit on Mac OS
X and Macromedia's
Homesite on Windows, but there are excellent free alternatives out
there and when it comes to text editors, personal preferences take on an
almost
religious importance, so we won't proselytize about that here.
The Googlebot recognizes several Sitemap formats, ranging from a simple
list of URLs to Sitemaps already created using something called the
Open Archive Initiative protocol for metadata harvesting, a format
apparently popular with library collections. The OAI protocol is an
advanced XML specification that you don't need to worry about if you
don't already understand. An intermediate XML format is what we
recommend, over the simple URL list, because of the additional
information you can associate with each constituent URL of your site.
If you do want to just get started quickly, simply create a text file
that looks like this:
http://www.example.com/catalog?item=1
http://www.example.com/catalog?item=11 ...
making sure that the file in question does not include embedded newline
characters and uses the UTF-8 text encoding (check your text editor
settings). Also, your sitemap may not contain more than 50,000 URLs and
all URLs must me fully-formed since they will be used directly during
the Googlebot's crawl.
Getting Fancy
The more advanced format isn't much more difficult to create and lets
you specify additional information about each URL. The protocol is
described fully
here and is too detailed to explain here. Your finished file will
look something like this, except (hopefully) with more URLs specified:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://www.cafeid.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.cafeid.com/art-over.shtml</loc>
<changefreq>weekly</changefreq>
</url>
</urlset>
Your Sitemap's location dictates what URLs can be included in it. A
Sitemap placed at the root level of your site can specify any URLs on
that site, while a Sitemap placed at www.yoursite.com/images can not
include URLs under www.yoursite.com/banners, for example.
You can take as full or as little advantage of the availability of the
various additional XML tags available in this format. Each <url> needs
to include at least the <loc> specification, but need not include the
other three, and all URLs in a Sitemap file must be encapsulated within
the <urlset> tag. We recommend using at least the <lastmod> tag and the
<changefreq> flag to let the Googlebot know how often it should check
your site for updated content. Be sure to change the date, and maybe
even the time, specified in the <lastmod> tag any time you actually
update your site.
One more caveat is that your URL specifications must be XML-encoded,
similarly to the way they're encoded under RSS. What this means is
spelled out in detail
here, but essentially, what you're doing is converting a URL like
http://www.yoursite.com/view?widget=3&count>2
to look like this:
http://www.test.org/view?widget=3&count>2
(Note the substitution for the HTML entities & and > for the "&" and ">"
symbols.)
Done. Now What Do I Do With It?
You're almost home. Upload the Sitemap file you create to your server
and then add the URL to the file itself using your Google Sitemaps
account. You don't need to use the account, but doing so will allow you
to keep track of what you've uploaded. You're welcome to compress your
Sitemap file using gzip, found typically on Mac OS X, Linux and BSD
(normal PC zipping won't work, although you can certainly find a
third-party gzip program for your Windows box). Click the "Add Your
First Sitemap" link on the main Sitemaps page after you've logged into
your Google Sitemaps account, and that's all there is to it!
You can use your Sitemaps account to keep track of and receive
diagnostic information about your Sitemap submissions. You don't need to
create a Sitemaps account, however, and if you already have a Google
account for receiving Alerts, for accessing the Web Developer APIs and
so on, your existing account will work as a Sitemaps account
automatically.
Google has already played a significant role in shifting the paradigm of
discovering the Web from doing so by following links to doing so by
searching, and the company shows no signs of slowing down. Subscribing
may well be the next paradigm, based on the flexibility of the protocols
that put content syndication in the hands of mere mortals, and getting
your content cataloged in these formats should be among your first
priorities. The web browser and operating system is adjusting quickly to
this new paradigm, and you should be too.
|