Including Content

For information on search engine ranking placement, please see writing for search engines.

Why isn't my website in the search engine?

There are a few reasons your website might not be in the search engine index. Please check all of the following before submitting a help request.

For Static HTML Pages...

1. Your site isn't on northwestern.edu

By default, the search engine indexes all URLs matching the pattern *.northwestern.edu/*. If your site is affiliated with Northwestern but hosted on a different top-level domain, you need to put a request to search-help@northwestern.edu asking for this URL to be added to the crawl pattern.

2. There are no links to the site or page

If there are no links to your website or webpage, the search engine, just like your website visitors, won't be able to find it. Check to see where links to your page exist using the link: query prefix, or use the form below


(enter your URL in the box above)

3. There might be a robots exclusion

It's possible that your web server is restricting access to your website via robots.txt rules. You can usually check this by reading the file (if it exists on your web server). For sites hosted on nuinfo, the primary Northwestern server, robots.txt is located at http://www.northwestern.edu/robots.txt. If your website is at another address, just replace everything before the last slash, e.g. http://www.example.com/robots.txt. If this file does not exist on your web server, you have no corrections to make. If the file does exist and your site URL appears in this file's list of disallowed directories, you may need to edit the file to remove the line containing your site or contact a server administrator who can help you with this.

This is the most common type of robots block. For more robots exclusion methods to check, please see the excluding content page.

For dynamic Content (ASP, PHP, ColdFusion, et al)...

The search appliance's crawler has built-in algorithms to ascertain whether it has hit an infinite loop within a web application. Often this is triggered when an application offers rich sorting features that would generate thousands upon thousands of unique views (each a distinct document). This functionality is important for visitors, however, it consumes resources on the crawler.

Several web application types are automatically blocked by the search engine. There are still some alternatives to standard crawling that can aid the appliance in indexing the content within such systems.

Feeds

In order to inject your content into the index, you can develop a Feed for it. There is a comprehensive page on Feeds at the Google enterprise website.

Feeds are essentially XML documents that contain lists of URLs to be indexed. In general, the process for developing a feed for the appliance would probably be something like:

  1. Create search-engine friendly view for records (minimal links, minimal UI magic)
  2. Create script to generate recent/important content links in XML format
  3. Create a daemon (program) to issue an HTTP request to the Appliance, submitting this XML document.

Please consult with us at search-help@northwestern.edu first about the type and volume of content you plan to index.