CHOOSING A WEB SEARCH ENGINE

To get gold, you must sift through mounds of raw ore. To find valuable nuggets of information on the Internet, you’ve got to sift through an almost unfathomable number of Web pages—which explains the popularity of Web search services. These sites are information refineries, helping us quickly distill useful material from the mountains of digital dross that comprises the Internet. Search engines also are intensely competitive products, trying to win your loyalty both with fast-and-furious marketing campaigns and by constantly improving their technology. The hot search engine of last year is not necessarily this year’s best. Search engines also compete with directories, like Yahoo!, which organize sites by subject categories instead of listing them in an order based on how relevant they are to your search terms.

All search engines use the same basic technology: "Spiders" automatically roam the Web looking at the contents of every Web site they have the resources to find, making copies of at least some of the contents. Indexing tools digest this gathered material so that when you conduct a search, relevancy ranking software can prioritize the sites. Although the underlying technology is similar, each search engine implements it differently.

SPIDER TECHNOLOGY
Spider (a.k.a. robot or ‘bot') software perpetually crawls from site to site, taking a complete measure of the Web. Once it finds a new site or one that has changed, it sends material back to the search engine. While some spiders grab every site they find, others prioritize their efforts by determining a site’s popularity or how frequently it has changed. The assumption here is that popular and frequently changed sites are more likely to be the most useful.

Deciding whether to be selective or all-inclusive is tricky. Say an undiscriminating spider finds a site that hasn’t been updated in two years and sends back 60 pages for indexing. This could clog your search results lists with time wasting, out of date pages. On the other hand, old information isn’t necessarily bad; there might be valuable data hidden away on that old site.

INDEXING
It would be untenable for search engines to examine all Web sites every time you conducted a search. Instead, search engines use indexes, which list the words contained in as many as 100 million Web pages, as well as other information about the sites. Each item listed in the index points to the resource in which it is located. When you run a search, the engine consults its index and displays a list of all sites that match your query—together with links to those sites.

Indexing technology isn’t new, but each search engine has its own implementation, p primarily at the spider stage. Some engines, such as AltaVista, Excite, and Infoseek, have their spiders send back every word in a Web site for indexing. Others return only a portion of the text in a Web site—just the first half, for example. As with spidering, indexing more data may either waste your time with nonproductive results or unearth nuggets of information hidden deeply in obscure sites.

RELEVANCE RANKING
Let’s say that our broad search for "bicycle" nets 25,000 Web pages. Obviously, that’s too many sites for you to wade through in order to find the specific information you want. To address this problem, search engines try to determine which sites are most relevant to you. All engines display the most relevant sites at the top of the search results list. They also provide additional cues, such as a numerical ranking.

Again, search engines use different techniques for determining relevancy. One common technique is to count the number of times the search term appears. The logic is that if the word "bicycle" appears repeatedly, it’s more likely to be on that topic than a page with just one instance of the word.

Some search engines weigh how often the search term appears in the first n words of the document, the reasoning being that documents are more likely to be relevant if search terms appear higher. If your query includes multiple words, the search engine may consider how closely together those terms appear.

FLESHING OUT FINDS
Besides relevancy rankings, all engines provide at least some textual description of found sites. These are essential for determining whether a site meets your needs. Some sites create the descriptions by using the first o characters of the document, which sometimes results in the display of meaningless information such as menu options. Others, such as Excite and Lycos, use their own technology to extract key words and phrases that describe the site. These techniques employ linguistic analysis to ferret out important words.

Still other sites use the HTML <META> description tag in which the site developer describes the site. Where developers are conscientious, this is useful. If the description is missing or poorly written, the citation in the search results is not useful.

ENGINES OF CHOICE
The search engine business moves quickly. For example, WebCrawler (http://www.WebCrawler.com) was one of the first search engines to appear and we included it in our last roundup. It’s still around, but it’s no longer competitive so we left it out. Moreover, new search engines seemingly appear every week. However, we didn’t include new search sites unless they offered something unique and useful, or unless their capabilities com- pared well with the established search engines.

The six sites we review are the top dogs; the ones that give the best overall results on a variety of searches. They’re not equal: Some find more information than others, and some offer more non-search services. But each has something valuable to offer.

 
ALTA VISTA     http://www.altavista.digital.com
AltaVista is the monarch of brute-force searches. Recently released a new index that digests 100 million pages, roughly twice as many as its chief competitors. Test results show AltaVista consistently found more Web documents than HotBot, its closest rival.

AltaVista’s interface, which formerly was a bit of a garbled mess, is much improved (though still not quite as easy to use as HotBot’s). The home page now contains only a search form, an ad, and standard menu options. Similarly, its search results screens are far less cluttered and easier to navigate than before. Its relevancy rankings proved reliable in our tests, though its site descriptions often weren’t useful.

AltaVista is trying some innovative approaches for refining searches. The most successful of these attempts occurs after you search, then click the Refine button. A series of words gleaned from found documents appears; you can elect to exclude or "require" each of these associated terms in a subsequent search. Among the refining words for our search on "bicycle" were "racing" and "helmet." Refining the search with those words helped find pages that discussed issues related to racers wearing helmets.

There’s also a flow chart that graphically summarizes search results, letting you choose among words associated with the top- level key words found at the site and include these in subsequent, refining searches. Overall, we found this feature less useful as it was somewhat time-consuming and imprecise.

AltaVista has added an advanced search page in which you can frame complex Boolean operators. You can perform proximity searches as well as search for pages changed within a specific timeframe. Another unique feature allows you tell AltaVista to attach added importance to specific words when creating its relevancy rankings. However, you still must master its obtuse search syntax to create complex searches.

AltaVista’s usability has improved to the point that it’s now on a par with that of most other search engines—although, again, it still doesn’t match HotBot’s friendly forms-based approach. Nonetheless, AltaVista has clawed its way back to the top spot when measuring raw search power. If AltaVista can’t find it, there’s a high likelihood that it’s not on the Web.

 

HOTBOT     http://www.hotbot.com
HotBot is a powerful and easy-to-use site that consistently came in second behind AltaVista in our test searches. Most notably, HotBot claims to reindex its approximately 54 million pages every two weeks! That proved out in our tests: Sites that had been updated in the last 10 to 14 days consistently turned up in our HotBot searches but not in searches of the other engines.

Moreover, HotBot’s advanced querying capabilities are, along with AltaVista’s, the strongest in the group. It will return exact matches to your query or near matches. You can limit your search to specific domains (such as .com or .org) and to specific geographic locations. It can even find items embedded in Web pages like ActiveX controls, Java applets, and specific types of images.

Our favorite HotBot feature is a search modifier that searches only for pages that have changed within a specified period of time, ensuring timely results. You also can search a site to a specific depth (three pages deep, for example). In addition, you can save searches for later reuse.

HotBot’s customizable forms-based interface is, in our view, the easiest to use of any search engine. You type your search term and then select how to modify it from a drop-down list. You can click on icons to display additional forms in which you apply advanced searching techniques such as querying by domain.
 
HotBot’s relevancy ranking is generally good but not the best. In our tests, it sometimes wasn’t clear why certain pages turned up at the top of the list. Its site descriptions generally were useful, but not always. HotBot pulls its descriptions from the <META> description tag or, when that isn’t present, from the top headings. This sometimes results in descriptions that are gibberish.

For now, HotBot claims the silver medal for power searching. Its directories, which are due for some work, proved anemic in our tests. But when it comes to finesse, HotBot—and its easy-to-use forms for framing searches—has no equal.

 
EXCITE     http://www.excite.com
Excite’s 50 million indexed pages match the numbers found at other leading search engines, but its search results were consistently in the bottom half of the group. Nonetheless, Excite still offers a lot. For instance, we appreciated its ability to search for concepts: A query for ‘martial arts" finds sites about kick-boxing and karate even if the original search term isn’t in the page. After a search is done, Excite suggests words for narrowing the query. A search for sites about the Zen poet Basho suggested we also add the search term "haiku," an astute suggestion. Unfortunately, the feature didn’t always work this well. After conducting a broad search for "bicycle," among the words it suggested adding to refine the search were bicyclists, bicycles, bike, and biking.
 
Power searchers will appreciate the appropriately named Power Search page, a forms-based aid that lets you quickly create Boolean searches. Like Infoseek, Excite also does an excellent job of helping you after you conduct a search. Besides its suggestions for further narrowing the search, if you find a page that appeals to you, clicking on a link automatically finds more pages that are similar to it.

We found Excite’s relevancy rankings to be solidly accurate, although we disliked its site summaries. Excite draws its summaries from text within the page, but not necessarily from introductory text. The descriptions often mirrored the second, third, or fourth sentence on a page, which often weren’t descriptive.

Excite has a directory of 140,000 Yahoo-style listings, which pales compared to Infoseek’s 500,000 directory listings. And it doesn’t offer much brute search power. However, it does make searching—and refining searches—simple, and its excellent relevancy ranking helps you home in on useful information.
 

INFOSEEK     http://www.infoseek.com
Infoseek is both broad and deep. It was a close third toAltaVista and HotBot in overall search power, and it is chock-full of amenities, such as an extensive directory. Infoseek claims approximately 60 million indexed pages.and our tests confirmed that it typically found almost as many pages as HotBot and AltaVista. You can use its natural language query capabilities, hut for greater precision. you must learn its somewhat complex advanced search syntax.

If you do take the trouble, you’ll be amply rewarded. You can, for example, search not just for words but also for sites with links to specific URLs. This enables you to perform tricks like searching for all sites with links to your site.

Previously cluttered and confusing, Infoseek’s interface is now well organized and easy to navigate. Infoseek also provides the best set of post-search options. To the left of the search results is a column with three headings. Best Bets lists interesting sites that closely match your search criteria. The Related News heading displays recent news stories about your search term. Related Topics lists similar topics. This was sometimes useful, sometimes not: A search for "cherry torte" listed "desserts" and "chocolate" as related topics, but a search for meaning of life" found "Renting vs. buying" and "Baseball in Japan" as related topics.

In addition, you can easily refine your searches by specifying new criteria and searching again within the initial results set. For instance, our search for "bicycle" turned up more than 126.000 hits. A subsequent search within that group for "tire repair" found more than 3,000 Web sites and Usenet postings about that subject.

Infoseek doesn’t have quite the search muscle of HotBot or AltaVista and its descriptions of found sites often were gibberish compared to Northern Light’s and HotBot’s. However, Infoseek’s myriad other services such as its generous Web directory. access to the news, and its yellow and white pages directories, make it the strongest choice for those who prefer one-stop shopping.
 

LYCOS     http://www.lycos.com
Lycos isn’t the most powerful search engine, but it is the most highly customizable—and it does a solid job of combining Web search and directory services. Its customization capabilities let you fine-tune how its relevancy ranking works. You access this feature from Lycos Pro (the advanced search interface). You can set it, for instance, to rank documents in which search terms appear in the title more highly than other documents. Besides tweaking the relevancy ranking, you can conduct the actual search from Lycos Pro.

Lycos also provides an extensive number of ways to create and modify searches, including proximity searches and setting the order in which search terms must appear in the documents. You can perform natural-language queries such as, "What is the meaning of life" (which, happily. turned up 43,668 responses). However, unlike the other very customizable search engine reviewed, HotBot, Lycos doesn’t let you search based on when pages were last modified.

We found Lycos’ relevancy ranking consistently to be the strongest in the group. However, its method of extracting descriptions from text in the documents often failed to tell us clearly what the site contained.

Despite its flexibility in modifying searches and results, Lycos didn’t find as many documents as most of the other search engines. Lycos boasts it has indexed 100 million separate URLs and that it rebuilds its index every two weeks. URLs aren’t pages, however, and Lycos failed to match HotBot’s ability to find recently updated information.

Nonetheless, we like Lycos because of its useful mix of features. Its staff of reviewers has specified a number of Web avenues as "Top 5 percent sitees." Not only can you read and enjoy the reviews themselves, you can run searches strictly within the reviewed sites if you wish. Lycos also has an extensive site directory, as well as news, weather, sports, and the ability to find businesses and people. Power searchers will probably prefer other services, but Lycos still is attractive if you like many Web services in one place.

 
NORTHERN LIGHT     http://www.northernlight.com
Northern Light is the newest kid on the block and should quickly gain popularity. Its search prowess already is strong and its unique features make it more useful for serious researchers than the other search sites. Northern Light had been available to the public for only a month when we tested it, yet its results were solidly in the middle of the pack, with particularly strong results on searches for obscure sites. A spokesman said search results would improve over the coming months as the number of indexed pages climbs from 38 to 50 million. The quality of its relevancy ranking and site descriptions were excellent.

Besides serving as a Web search engine, Northern Light also aggregates and serves the content of about 1,800 publications, many of which aren’t available elsewhere on the Web. You can search the entire Web, its special collection of publications, or both at once. For items found in the special collection, Northern Light provides a free article synopsis. Viewing an entire article costs as much as four dollars, with a typical charge being a dollar.

Northern Lights’ unique custom search folders are an often-successful attempt to categorize found documents. Northern Light places found documents into categories and subcategories that it displays as an outline. The feature worked best when the search terms were ambiguous. For instance, we performed a search for "web spider" and Northern Light returned articles about both insects and the technology used by Internet search engines.

Custom search folders made it obvious which were which and thus saved a lot of time. This system isn’t infallible though: One of the initial subgroups to our search for bicycle was "rhythm & blues." Still, Northern Light is onto something significant. We found many of the extra-cost publications to be specialized enough to be truly useful. For example, we found excellent articles about forest management that weren’t available elsewhere on the Web. Combined with its already strong search engine, this makes Northern Light an excellent choice for serious information miners.