Northwestern Search Documentation: Content Crawl & Indexing

It's important to make a distinction between crawling and indexing. Crawling is the process by which new URLs are discovered by following links in pages. Indexing is the process by which content in pages known to exist is read and processed for search.

Crawling

The NU crawler begins to look for content at the main University page, http://www.northwestern.edu/, and follows links as far as it can, discovering and indexing as many University pages as possible whose URLs match the pattern *.northwestern.edu/*

The crawler runs on a constant schedule, looking for new content as frequently as possible.

How often does the crawler run?

The crawler runs constantly, looking for new URLs as frequently as it can. It's unclear how often this is, but it generally seems to make a round trip every day or two.

Is there some way to speed this up?

No. The algorithm that determines how frequently pages are checked for changes in their contents is based on historical record of update frequency. Pages that change infrequently are given a lower priority.

We understand this isn't always ideal; situations exist in which it's vital to have additions, edits, or deletions to websites reflected in the engine more immediately, such as when important announcements are posted or information must be redacted immediately from the web.

For this, we encourage you to use the self-service search console.

How can I submit my page for crawling?

You should only have to submit your site for crawling if it resides on its own top-level domain (e.g. www.icair.org) and isn't linked from a Northwestern page already in the indexer.

If the former is true, please let us know about your URL at webmaster@northwestern.edu. If the latter is the case, we suggest you get linked so that the search engine, but more importantly, all of your visitors can get to your site!

Indexing

Indexing is the process by which the content on your page is examined and ingested into the "index" -- a database of keywords and associated pages (URLs) containing them.

What content is indexed?

The entire page is stored in the index. Specific components of the page are used by the search algorithm to determine relevance based on keyword searches. The easiest way to get your content ranking high for a given term is to include that term in your content! The indexing algorithm ranks pages from three components of the page:

What about meta tagging and keywords?

The most recognized meta tagging scheme is the dublin core metaset. A simple dublin core generator tool is available at the NU Library site.

In certain cases, the search appliance has been known to use keywords to solve problems of common mis-spellings, e.g. "beenen" instead of "bienen".