Website Search Engines Questions
Ten Things to Know About Google

1. The database that Google licenses to Yahoo! [http://google.yahoo.com] is not the same size: it's smaller than the Google.com database. It does not contain links to cached versions of pages. This database is also used to supply "fall-through" content (material not in Yahoo's own database). It is often found listed as "Web page" content.

2. Google utilizes the Open Directory Project database as its Web Directory [http://directory.google.com].

3. You can search stop words by placing a + in front of the word (ex. "+To +Be +Or Not +To +Be").

4. At the present time the Google database is refreshed about once every month.

5. You can limit your search to only .pdf files by using the syntax filetype:pdf.

6. Google is the only major search engine to crawl Adobe Acrobat .pdf files.

7. If you are a frequent Google searcher, save time by using the Google Toolbar [http://toolbar.google.com] and Google Buttons [http://www.google.com/ options/buttons.html].

8. A Boolean "OR" is available with Google. For it to function, capitalize the OR.

9. Google only crawls and makes searchable the first 110 k of a page. Long documents may have substantial content invisible to Google.

10. Entering a U.S. street address into the query box will return a link to a map of that address location. Typing in a person or business name, city, and state will also run the query to the Google phone directory. Several other combinations are available that will also query the phone directory service, including typing in the area code and number to run a reverse search [http://www.google.com/ help/features.html#wp].
 

Ten Things to Know About AllTheWeb

1. AllTheWeb licenses its database to Lycos. The identical database is searched and makes up some of the content on a Lycos results page.

2. Unlike Google and AltaVista, this search engine does not have a limit on the amount of content crawled on a Web page.

3. AllTheWeb indexes every word. Words traditionally considered as "stop words" are searchable.

4. AllTheWeb does not permit the use of Boolean operators.

5. If plus and/or minus signs are not used, AllTheWeb implies a plus sign in front of each term or phrase. This results in an implied "anding" of terms.

6. AllTheWeb is now promising a complete refresh of its database every 9-12 days.

7. AllTheWeb permits syntax to be used direct from the "basic" search page to limit a query. See http://www.alltheweb.com/ help/basic.html#special.

8. A query to the AllTheWeb text database simultaneously runs the search in the AllTheWeb Image, Video, MP3, and FTP databases. If it finds anything, these results are linked on the right side of the results page.

9. AllTheWeb offers a search engine dedicated to Mobile Web content [http://mobile.alltheweb.com].

10. Fast Search and Transfer (FAST), the company behind AllTheWeb, has deployed its software to power the Scirus science search engine from Elsevier.
 

Ten Things to Know About AltaVista

1. AltaVista is the only major search engine that allows a searcher to use the proximity operator, NEAR (in simple search) near (advanced search). Using this operator finds terms within 10 words of each other in either direction.

2. AltaVista indexes only the first 100 k of text on a page.

3. An asterisk (*) can be used in a phrase to represent an entire word. (Ex. "One small step for man, one giant * leap for mankind")

4. AltaVista News http://news.altavista.com] is "powered" by Moreover. This continuous feed of material can be searched using AltaVista syntax.

5. The use of the "sort by" box on the AltaVista Advanced interface allows you to give certain words or phrases a higher relevancy weighting.

6. Caveat: If you use Advanced Search, make sure to place some term or terms in the Sort-By box; otherwise, results return in completely random order.

7. AltaVista's directory comes from Looksmart.

8. AltaVista's advanced search does not allow for the use of + and — signs.

9. If you search AltaVista in the "simple" mode entering multiple terms without syntax, it will result in an "implied" OR. In the advanced mode, multiple terms are considered a phrase.

10. AltaVista software powers the Health Resources and Services (U.S. government) search engine. This means that all AltaVista syntax can be utilized there. This site also illustrates AltaVista capability of indexing full-text .pdf documents on the site-specific and intranet level [http://search.hrsa.gov].
 

Ten Things to Know About MSN Search

1. MSN (Microsoft Search Network) Search is "powered" by an Inktomi database. Remember that Inktomi licenses its database to many search sites. Each site gets a different "flavor" of the total database.

2. The MSN Advanced Search interface offers numerous limiting options via fill-in boxes and pull-down menus [http://search.msn.com/advanced.asp].

3. The Advanced Search interface permits limiting to pages at a certain depth in the site. For example, limiting to pages Depth 3 will limit the search to only pages no more than three directories deep from an entire site [e.g., http://www.testsearch.com/ Directory1/Directory2/Directory3/].

4. MSN Search allows use of the asterisk (*) as a truncation symbol.

5. According to the most current Search Engine Showdown rankings, MSN Search has the largest database of any Inktomi partner.

6. The directory portion of MSN search is powered by the Looksmart database.

7. On the Advanced Search interface, checking the "Acrobat" box will retrieve pages with links to pages that contain .pdf files. It does not search content "inside" these files.

8. Greg Notess points out that the same syntax available to limit Hotbot will also work with MSN Search [http://hotbot.lycos.com/ help/tips/search_features.asp].

9. Danny Sullivan notes that MSN also employs human editors to "hand-pick" key sites in the Web Directory and Featured Link sections of the site. Although most of the time the "Featured Links" represent major MSN advertisers, editors can add other content.

10. Selecting and search under the MSN "News Search" tab returns results predominantly from MSNBC.
 

Ten Things to Know About Northern Light

1. Make sure to study the Northern Light "Power" search page. It provides many limiting options without the knowledge of any syntax [http://nlresearch.northernlight.com/ power_research.html].

2. Instead of entering http://www.northernlight.com, use http://www.nlresearch.com to go straight to the Northern Light Research site. This site aimed at the enterprise market (but available to any searcher) contains access to several databases not available from the main URL. Most of these resources are fee-based. They include EIU Search and market research content from FIND/SVP and MarkIntel.

3. Northern Light provides FREE full-text access to a database of continuously updating news content from 56 newswires. Material stays in this database, available for free access, for 2 weeks. Then the content moves to the Northern Light Special Collection database.

4. Northern Light's Special Editions are subject specific portals that combine material from the "open Web" and NL's proprietary databases. Topics of Special Alerts include XML, managed care, and electronic commerce.

5. The Northern Light Special Collection currently contains content (fee-based, pay-per-document) from over 7,100 sources. A catalog of these publications is available at http://nlresearch.northernlight.com/ docs/specoll_help_catlook.html.

6. Northern Light allows the use of Boolean operators and + and - signs.

7. Multiple truncation symbols can be used in a query. Northern Light has two truncation symbols. The asterisk (*) for multiple letters and the percent symbol (%) for single or absent letters, e.g., medieval/mediaeval.

8. In addition to the limiting capabilities of the "Power" search page, NL has several terms available for field searching. These include text:, text:, and pub:. (This last prefix allows searching in a specific Special Collection publication title.) You can find a complete list at 
http://nlresearch.northernlight.com/ docs/search_help_quickref.html.

9. Northern Light's free "Alerts" feature is one resource you must know about. This feature allows you to set up search strategies in ANY/ALL of the NL databases and have those strategies searched up to three times daily. If any new material hits on the strategy, results will be delivered to you via e-mail. I use this tool to bring me a customized feed of news via the NL News Search database. Remember, the full-text content is free to access for 2 weeks.

10. Northern Lights "Geo Search" provides an opportunity to search the Web with keywords and U.S. and Canadian address information. Results also get the benefit of NL's organization with its "custom folders."
 

Ontologies, Controlled Vocabularies, XML, and Web Search Engines
I am very excited to see that controlled vocabularies and the building of ontologies have come into vogue.

Some of this "hipness" has been caused by the promise and excitement surrounding XML (eXtensible Markup Language). However, I am not sure if the coming of XML will help the general-purpose search engine, though it should clearly help specialized, focused, and Invisible Web engines become much more useful resources.

Why the hesitation?

The general-purpose engines, as we know and love them today, hypothetically index each page, massive amounts of data coming from just about anyone who wants to produce Web content and put it on a publicly accessible server.

The problem for implementation of a controlled vocabulary with this material is really one of creation. Who would create it? Who would maintain it? Who would do the cataloging? Would entire sites be cataloged at the page level or only a specific page (the top page)? Who would manage such a project? Where would the money come from?

Controlled vocabularies and XML show a great deal of promise for certain types of search engines because these types of engines can much more easily create and enforce a set of agreed upon standards. Many issues would need resolution before we could apply controlled vocabularies to make searching the massive amount of material on the open Web more effective.
 

The Future: 
New Tools on the Way
When you learn about new search tools and share that knowledge with others, you not only improve your own searching, but you help to make a better future for all searchers.

Here are some new search products that show a lot of promise, a few more potential "quick hits." With the vulnerability of the Internet industry of late, let's hope these products survive. Even if the actual companies do not survive, the technology is still worth knowing about. Have fun!!!

Three New General Purpose Search Engines
Competition for Google?


A New Image Search Tool


Real-Time Search
Patented technology to search resources updated in real-time.


Natural Language Search Technology
This product is getting a lot of attention.


Now let's see if you've learned your lessons. How long will it take before you've tried all these new promising sites out? The test clock starts...now!
 

This Article Contains Inaccuracies: 
Essential Reading
In the time it takes this article to move from the author to the editor to the publisher to the printer to you, undoubtedly something mentioned in this article will have changed. Some feature will have appeared, another vanished. The working searcher must simply make a policy of staying on top of those changes.

Those of you who need to keep current on the Web search world should monitor the following sites as often as possible. All these sites are free and most contain free e-mail newsletter and updates.

SearchDay
http://www.searchenginewatch.com/ searchday/
Written by Chris Sherman. Daily updates.

Search Engine Watch
http://www.searchenginewatch.com
A resource rich site that offers a free monthly newsletter.

Search Engine Showdown
http://www.searchengineshowdown.com
Librarian Greg Notess's site. Updated on a regular basis. Greg also manages the Search-L list.

ResearchBuzz
http://www.researchbuzz.com
Written and compiled by Tara Calishain. Daily updates.

TVC (The Virtual Chase) Alert
http://www.thevirtualchase.com
Written and compiled by Genie Tyburski. Daily updates.

The Virtual Acquisition Shelf and News Desk
http://resourceshelf.blogspot.com
Compiled by Gary Price. Daily updates.

Free Pint
http://www.freepint.com
Fortnightly newsletter edited by Will Hann. Also offers Web discussion boards.

News Breaks from Info Today
http://www.infotoday.com/newsbreaks/
General information industry coverage of breaking news, that often features news of the Web search world.