Sitemaps

Pandosearch is capable of parsing sitemaps. Sitemap information is used to help us discover new content and update existing content.

Sitemap parsing is enabled by default, but can be disabled if needed.

File format

Generally speaking, Pandosearch can parse sitemaps in Sitemaps XML format. This format is widely used by sites for informing search engines about what can be found where.

We can parse two kinds of files: urlset and sitemapindex.

Format: urlset

This is the basic sitemap format for listing URLs present on a website.

Example file:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.example.com/about</loc>
      <lastmod>2006-01-01T09:09:09+00:00</lastmod>
      <priority>0.5</priority>
   </url>
</urlset>

The urlset is the root element. For each url, we use the loc and lastmod values:

  • loc – The URL of the page to retrieve.
  • lastmod – We only retrieve pages for which lastmod has changed since the previous crawl. This can be disabled for systems that do not reliably update the lastmod property (e.g. in auto-generated sitemaps).

Any other fields specified are ignored.

Format: sitemapindex

When you want to list multiple sitemaps you can do so in a sitemap index file. This sitemapindex contains links to one or more urlset sitemaps.

Example file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2004-10-01T18:23:17+00:00</lastmod>
   </sitemap>

   <sitemap>
      <loc>http://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

The sitemapindex is the root elemeent. For each sitemap, we use the loc and lastmod values:

  • loc – URL of the sitemap to retrieve. Note that sitemaps may be, but do not have to be gzipped.
  • lastmod – We only retrieve sitemaps for which lastmod has changed since the previous crawl. This can be disabled for systems that do not reliably update the lastmod property (e.g. in auto-generated sitemap index files).

Any other fields specified are ignored.

Configuration options

A couple of options are available for sitemaps. These options are usually discussed during initial implementation.

Please contact support if you would like to receive more information on these options.

Sitemap location

We will try to read the location of your sitemap(s) from your robots.txt file.

For example, we will use your sitemap when your robots.txt file contains one or more lines like this:

Sitemap: https://www.example.com/sitemap.xml

Pandosearch can also be configured to use one or more explicit sitemap URLs (e.g. specifically created for Pandosearch indexing only).

Leading sitemap

By default, sitemaps are used in addition to organic crawling. Organic crawling means that while crawling, Pandosearch is following links to other pages found, where the process is repeated until no new pages are found.

A sitemap can also be configured as "leading". If so, the sitemap acts as the single source of truth for Pandosearch. The sitemap must contain all URLs that should be indexed, as any other URLs will not be included.

Using a leading sitemap can be beneficial or not, depending on your specific content and needs.