All search engines use the same basic technology: "Spiders" automatically roam the Web looking at the contents of every Web site they have the resources to find, making copies of at least some of the contents. Indexing tools digest this gathered material so that when you conduct a search, relevancy ranking software can prioritize the sites. Although the underlying technology is similar, each search engine implements it differently.
SPIDER TECHNOLOGY
Spider (a.k.a. robot or ‘bot') software perpetually crawls from site
to site, taking a complete measure of the Web. Once it finds a new site
or one that has changed, it sends material back to the search engine. While
some spiders grab every site they find, others prioritize their efforts
by determining a site’s popularity or how frequently it has changed. The
assumption here is that popular and frequently changed sites are more likely
to be the most useful.
Deciding whether to be selective or all-inclusive is tricky. Say an undiscriminating spider finds a site that hasn’t been updated in two years and sends back 60 pages for indexing. This could clog your search results lists with time wasting, out of date pages. On the other hand, old information isn’t necessarily bad; there might be valuable data hidden away on that old site.
INDEXING
It would be untenable for search engines to examine all Web sites every
time you conducted a search. Instead, search engines use indexes, which
list the words contained in as many as 100 million Web pages, as well as
other information about the sites. Each item listed in the index points
to the resource in which it is located. When you run a search, the engine
consults its index and displays a list of all sites that match your query—together
with links to those sites.
Indexing technology isn’t new, but each search engine has its own implementation, p primarily at the spider stage. Some engines, such as AltaVista, Excite, and Infoseek, have their spiders send back every word in a Web site for indexing. Others return only a portion of the text in a Web site—just the first half, for example. As with spidering, indexing more data may either waste your time with nonproductive results or unearth nuggets of information hidden deeply in obscure sites.
RELEVANCE RANKING
Let’s say that our broad search for "bicycle" nets 25,000 Web pages.
Obviously, that’s too many sites for you to wade through in order to find
the specific information you want. To address this problem, search engines
try to determine which sites are most relevant to you. All engines display
the most relevant sites at the top of the search results list. They also
provide additional cues, such as a numerical ranking.
Again, search engines use different techniques for determining relevancy. One common technique is to count the number of times the search term appears. The logic is that if the word "bicycle" appears repeatedly, it’s more likely to be on that topic than a page with just one instance of the word.
Some search engines weigh how often the search term appears in the first n words of the document, the reasoning being that documents are more likely to be relevant if search terms appear higher. If your query includes multiple words, the search engine may consider how closely together those terms appear.
FLESHING OUT FINDS
Besides relevancy rankings, all engines provide at least some textual
description of found sites. These are essential for determining whether
a site meets your needs. Some sites create the descriptions by using the
first o characters of the document, which sometimes results in the display
of meaningless information such as menu options. Others, such as Excite
and Lycos, use their own technology to extract key words and phrases that
describe the site. These techniques employ linguistic analysis to ferret
out important words.
Still other sites use the HTML <META> description tag in which the site developer describes the site. Where developers are conscientious, this is useful. If the description is missing or poorly written, the citation in the search results is not useful.
ENGINES OF CHOICE
The search engine business moves quickly. For example, WebCrawler (http://www.WebCrawler.com)
was one of the first search engines to appear and we included it in our
last roundup. It’s still around, but it’s no longer competitive so we left
it out. Moreover, new search engines seemingly appear every week. However,
we didn’t include new search sites unless they offered something unique
and useful, or unless their capabilities com- pared well with the established
search engines.
The six sites we review are the top dogs; the ones that give the best overall results on a variety of searches. They’re not equal: Some find more information than others, and some offer more non-search services. But each has something valuable to offer.
ALTA VISTA http://www.altavista.digital.com
AltaVista is the monarch of brute-force searches. Recently released
a new index that digests 100 million pages, roughly twice as many as its
chief competitors. Test results show AltaVista consistently found more
Web documents than HotBot, its closest rival.
AltaVista’s interface, which formerly was a bit of a garbled mess, is much improved (though still not quite as easy to use as HotBot’s). The home page now contains only a search form, an ad, and standard menu options. Similarly, its search results screens are far less cluttered and easier to navigate than before. Its relevancy rankings proved reliable in our tests, though its site descriptions often weren’t useful.
AltaVista is trying some innovative approaches for refining searches. The most successful of these attempts occurs after you search, then click the Refine button. A series of words gleaned from found documents appears; you can elect to exclude or "require" each of these associated terms in a subsequent search. Among the refining words for our search on "bicycle" were "racing" and "helmet." Refining the search with those words helped find pages that discussed issues related to racers wearing helmets.
There’s also a flow chart that graphically summarizes search results, letting you choose among words associated with the top- level key words found at the site and include these in subsequent, refining searches. Overall, we found this feature less useful as it was somewhat time-consuming and imprecise.
AltaVista has added an advanced search page in which you can frame complex Boolean operators. You can perform proximity searches as well as search for pages changed within a specific timeframe. Another unique feature allows you tell AltaVista to attach added importance to specific words when creating its relevancy rankings. However, you still must master its obtuse search syntax to create complex searches.
AltaVista’s usability has improved to the point that it’s now on a par with that of most other search engines—although, again, it still doesn’t match HotBot’s friendly forms-based approach. Nonetheless, AltaVista has clawed its way back to the top spot when measuring raw search power. If AltaVista can’t find it, there’s a high likelihood that it’s not on the Web.
HOTBOT http://www.hotbot.com
HotBot is a powerful and easy-to-use site that consistently came in
second behind AltaVista in our test searches. Most notably, HotBot claims
to reindex its approximately 54 million pages every two weeks! That proved
out in our tests: Sites that had been updated in the last 10 to 14 days
consistently turned up in our HotBot searches but not in searches of the
other engines.
Moreover, HotBot’s advanced querying capabilities are, along with AltaVista’s, the strongest in the group. It will return exact matches to your query or near matches. You can limit your search to specific domains (such as .com or .org) and to specific geographic locations. It can even find items embedded in Web pages like ActiveX controls, Java applets, and specific types of images.
Our favorite HotBot feature is a search modifier that searches only for pages that have changed within a specified period of time, ensuring timely results. You also can search a site to a specific depth (three pages deep, for example). In addition, you can save searches for later reuse.
HotBot’s customizable forms-based interface is, in our view, the easiest
to use of any search engine. You type your search term and then select
how to modify it from a drop-down list. You can click on icons to display
additional forms in which you apply advanced searching techniques such
as querying by domain.
HotBot’s relevancy ranking is generally good but not the best. In our
tests, it sometimes wasn’t clear why certain pages turned up at the top
of the list. Its site descriptions generally were useful, but not always.
HotBot pulls its descriptions from the <META> description tag or, when
that isn’t present, from the top headings. This sometimes results in descriptions
that are gibberish.
For now, HotBot claims the silver medal for power searching. Its directories, which are due for some work, proved anemic in our tests. But when it comes to finesse, HotBot—and its easy-to-use forms for framing searches—has no equal.
EXCITE http://www.excite.com
Excite’s 50 million indexed pages match the numbers found at other
leading search engines, but its search results were consistently in the
bottom half of the group. Nonetheless, Excite still offers a lot. For instance,
we appreciated its ability to search for concepts: A query for ‘martial
arts" finds sites about kick-boxing and karate even if the original search
term isn’t in the page. After a search is done, Excite suggests words for
narrowing the query. A search for sites about the Zen poet Basho suggested
we also add the search term "haiku," an astute suggestion. Unfortunately,
the feature didn’t always work this well. After conducting a broad search
for "bicycle," among the words it suggested adding to refine the search
were bicyclists, bicycles, bike, and biking.
Power searchers will appreciate the appropriately named Power Search
page, a forms-based aid that lets you quickly create Boolean searches.
Like Infoseek, Excite also does an excellent job of helping you after you
conduct a search. Besides its suggestions for further narrowing the search,
if you find a page that appeals to you, clicking on a link automatically
finds more pages that are similar to it.
We found Excite’s relevancy rankings to be solidly accurate, although we disliked its site summaries. Excite draws its summaries from text within the page, but not necessarily from introductory text. The descriptions often mirrored the second, third, or fourth sentence on a page, which often weren’t descriptive.
Excite has a directory of 140,000 Yahoo-style listings, which pales
compared to Infoseek’s 500,000 directory listings. And it doesn’t offer
much brute search power. However, it does make searching—and refining searches—simple,
and its excellent relevancy ranking helps you home in on useful information.
INFOSEEK http://www.infoseek.com
Infoseek is both broad and deep. It was a close third toAltaVista and
HotBot in overall search power, and it is chock-full of amenities, such
as an extensive directory. Infoseek claims approximately 60 million indexed
pages.and our tests confirmed that it typically found almost as many pages
as HotBot and AltaVista. You can use its natural language query capabilities,
hut for greater precision. you must learn its somewhat complex advanced
search syntax.
If you do take the trouble, you’ll be amply rewarded. You can, for example, search not just for words but also for sites with links to specific URLs. This enables you to perform tricks like searching for all sites with links to your site.
Previously cluttered and confusing, Infoseek’s interface is now well organized and easy to navigate. Infoseek also provides the best set of post-search options. To the left of the search results is a column with three headings. Best Bets lists interesting sites that closely match your search criteria. The Related News heading displays recent news stories about your search term. Related Topics lists similar topics. This was sometimes useful, sometimes not: A search for "cherry torte" listed "desserts" and "chocolate" as related topics, but a search for meaning of life" found "Renting vs. buying" and "Baseball in Japan" as related topics.
In addition, you can easily refine your searches by specifying new criteria and searching again within the initial results set. For instance, our search for "bicycle" turned up more than 126.000 hits. A subsequent search within that group for "tire repair" found more than 3,000 Web sites and Usenet postings about that subject.
Infoseek doesn’t have quite the search muscle of HotBot or AltaVista
and its descriptions of found sites often were gibberish compared to Northern
Light’s and HotBot’s. However, Infoseek’s myriad other services such as
its generous Web directory. access to the news, and its yellow and white
pages directories, make it the strongest choice for those who prefer one-stop
shopping.
LYCOS http://www.lycos.com
Lycos isn’t the most powerful search engine, but it is the most highly
customizable—and it does a solid job of combining Web search and directory
services. Its customization capabilities let you fine-tune how its relevancy
ranking works. You access this feature from Lycos Pro (the advanced search
interface). You can set it, for instance, to rank documents in which search
terms appear in the title more highly than other documents. Besides tweaking
the relevancy ranking, you can conduct the actual search from Lycos Pro.
Lycos also provides an extensive number of ways to create and modify searches, including proximity searches and setting the order in which search terms must appear in the documents. You can perform natural-language queries such as, "What is the meaning of life" (which, happily. turned up 43,668 responses). However, unlike the other very customizable search engine reviewed, HotBot, Lycos doesn’t let you search based on when pages were last modified.
We found Lycos’ relevancy ranking consistently to be the strongest in the group. However, its method of extracting descriptions from text in the documents often failed to tell us clearly what the site contained.
Despite its flexibility in modifying searches and results, Lycos didn’t find as many documents as most of the other search engines. Lycos boasts it has indexed 100 million separate URLs and that it rebuilds its index every two weeks. URLs aren’t pages, however, and Lycos failed to match HotBot’s ability to find recently updated information.
Nonetheless, we like Lycos because of its useful mix of features. Its staff of reviewers has specified a number of Web avenues as "Top 5 percent sitees." Not only can you read and enjoy the reviews themselves, you can run searches strictly within the reviewed sites if you wish. Lycos also has an extensive site directory, as well as news, weather, sports, and the ability to find businesses and people. Power searchers will probably prefer other services, but Lycos still is attractive if you like many Web services in one place.
NORTHERN LIGHT http://www.northernlight.com
Northern Light is the newest kid on the block and should quickly gain
popularity. Its search prowess already is strong and its unique features
make it more useful for serious researchers than the other search sites.
Northern Light had been available to the public for only a month when we
tested it, yet its results were solidly in the middle of the pack, with
particularly strong results on searches for obscure sites. A spokesman
said search results would improve over the coming months as the number
of indexed pages climbs from 38 to 50 million. The quality of its relevancy
ranking and site descriptions were excellent.
Besides serving as a Web search engine, Northern Light also aggregates and serves the content of about 1,800 publications, many of which aren’t available elsewhere on the Web. You can search the entire Web, its special collection of publications, or both at once. For items found in the special collection, Northern Light provides a free article synopsis. Viewing an entire article costs as much as four dollars, with a typical charge being a dollar.
Northern Lights’ unique custom search folders are an often-successful attempt to categorize found documents. Northern Light places found documents into categories and subcategories that it displays as an outline. The feature worked best when the search terms were ambiguous. For instance, we performed a search for "web spider" and Northern Light returned articles about both insects and the technology used by Internet search engines.
Custom search folders made it obvious which were which and thus saved a lot of time. This system isn’t infallible though: One of the initial subgroups to our search for bicycle was "rhythm & blues." Still, Northern Light is onto something significant. We found many of the extra-cost publications to be specialized enough to be truly useful. For example, we found excellent articles about forest management that weren’t available elsewhere on the Web. Combined with its already strong search engine, this makes Northern Light an excellent choice for serious information miners.