Northwestern Search Documentation: Content Inclusion
Why isn't my website in the search engine?
There are a few reasons your website might not be in the search engine index. Please check all of the following before submitting a help request.
For Static HTML Pages...
1. Your site isn't on northwestern.edu
By default, the search engine indexes all URLs matching the pattern *.northwestern.edu/*. If your site is affiliated with Northwestern but hosted on a different top-level domain, you need to put a request to search-help@northwestern.edu asking for this URL to be added to the crawl pattern.
2. There are no links to the site or page
If there are no links to your website or webpage, the search engine, just like your website visitors, won't be able to find it. Check to see where links to your page exist using the link: query prefix, or use the form below
3. There might be a robots.txt exclusion
It's possible that your web server is restricting access to your website via robots.txt rules. You can usually check this by reading the file (if it exists on your web server). For sites hosted on nuinfo, the primary Northwestern server, robots.txt is located at http://www.northwestern.edu/robots.txt. If your website is at another address, just replace everything before the last slash, e.g. http://www.example.com/robots.txt. If your site URL appears in this file's list of disallowed directories, you may need to edit the file to remove the line containing your site or contact a server administrator who can help you with this.
For dynamic Content (ASP, PHP, ColdFusion, et al)...
The search appliance's crawler has built-in algorithms to ascertain whether it has hit an infinite loop within a web application. Often this is triggered when an application offers rich sorting features that would generate thousands upon thousands of unique views (each a distinct document). This functionality is important for visitors, however, it consumes resources on the crawler.
Several web application types are automatically blocked by the search engine. There are still some alternatives to standard crawling that can aid the appliance in indexing the content within such systems.
Feeds
In order to inject your content into the index, you can develop a Feed for it. The protocol and specifics on Feeds are beyond the scope of this document, however, there is a comprehensive page on Feeds at the Google enterprise website.
Feeds are essentially XML documents that contain lists of URLs to be indexed. In general, the process for developing a feed for the appliance would probably be something like:
- Create search-engine friendly view for records (minimal links, minimal UI magic)
- Create script to generate recent/important content links in XML format
- Create a daemon (program) to issue an HTTP request to the Appliance, submitting this XML document.
Please consult with us at search-help@northwestern.edu before allocating time to developing a feed. We would like to know in advance about the kind and volume of content you're interested in having indexed.