Crawling and Indexing
It's important to make a distinction between crawling and indexing. Crawling is the process by which new URLs are discovered by following links in pages. Indexing is the process by which content in pages known to exist is read and processed for search.
The NU crawler begins to look for content at the main University page, http://www.northwestern.edu/, and follows links as far as it can, discovering and indexing as many University pages as possible whose URLs match the pattern *.northwestern.edu/*
The crawler runs on a constant schedule, looking for new content as frequently as possible.
How often does the crawler run?
The crawler runs constantly, looking for new URLs as frequently as it can. How often this is is inconsistent between sites, but at minimum it should make a round trip every day or two.
Is there some way to speed this up?
No. The algorithm that determines how frequently pages are checked for changes in their contents is based on historical record of update frequency. Pages that change infrequently are given a lower priority.
We understand this isn't always ideal; situations exist in which it's vital to have additions, edits, or deletions to websites reflected in the engine more immediately, such as when important announcements are posted or information must be redacted immediately from the web.
Is there some way to slow this down?
Generally, no. The search appliance attempts to make sure its content is as fresh as possible. It's possible that a misconfiguration of your server or content management system could be misleading the search appliance and other clients about the last time your pages were updated. This can cause unnecessary crawling of your server.
The crawler will send a HTTP request to the server, including an If-Modified-Since HTTP header. HTTP headers bits of information that describe information about a server request or response and are not generally visible to web visitors. However, these headers are very important to web browsing software and search engines. The engine sends a request for some-page.html, and along with that URL, asks the server if the page has been updated since the last time it checked. An example If-Modified-Since line looks like this:
If-Modified-Since: Wed, 19 Oct 2005 10:50:00 GMT
If the page requested has not been updated since the date supplied, the server should respond with a code 304 (not modified). Otherwise, it should respond 200 (OK) and send the page. In a web browser, the former will result in the page being loaded from your hard drive and the latter will result in the page being downloaded from the web server. Improper responses to the If-Modified-Since header can cause an over-usage of bandwidth and traffic, as each page and file is sent to the client every time regardless of whether it has changed. In the case of crawlers, it tells the crawler that the content has been updated even if it has not.
Microsoft has a good webmaster toolkit that can tell you whether or not your site is responding properly to the IMS header. Enter a URL in the top box, then select a date in the calendar and click 'TEST'. If your server does not support IMS headers, or misrepresents the last time the page you supply was updated, you should inquire with your server administrator to see about getting better support for If-Modified-Since responses.
How can I submit my page for crawling?
You should only have to submit your site for crawling if it resides on its own top-level domain (e.g. www.icair.org) and isn't linked from a Northwestern page already in the indexer.
If the former is true, please let us know about your URL at firstname.lastname@example.org. If the latter is the case, we suggest you get linked so that the search engine, but more importantly, all of your visitors can get to your site!
Indexing is the process by which the content on your page is examined and ingested into the "index" -- a database of keywords and associated pages (URLs) containing them.
What content is indexed?
The entire page is stored in the index. Specific components of the page are used by the search algorithm to determine relevance based on keyword searches. The easiest way to get your content ranking high for a given term is to include that term in your content! The indexing algorithm ranks pages from three components of the page:
- The text between <title> elements.
- The text in the <meta> description field, e.g.
<META NAME="Description" CONTENT="Your descriptive sentence or two goes here.">
- The document's body contents.
For more information on how to improve each of these areas, see writing for search engines.
What about meta tagging and keywords?
The most recognized meta tagging scheme is the dublin core metaset. A simple dublin core generator tool is available at the NU Library site.
In certain cases, the search appliance has been known to use keywords to solve problems of common mis-spellings, e.g. "beenen" instead of "bienen".