Back to 11/96 Features: Portables with Personality
Up to Table of Contents
Ahead to 11/96 Features: The Search Is On (Agents)

11/96 Features: The Search Is On

Finding the right tools and using them properly can shed light on your Web search efforts.

By Cynthia Morgan, Senior Editor, Reviews

A WORLD WIDE WEB search tool can be a great information gathering implement or a good way to waste a lot of time. Plot your search properly and a query can deliver the precise information you need. Plunge in without thinking, and you could find yourself in the dark, stumbling through a maze of worthless Web sites. Web search engines use intelligent software agents, known as spiders, to gather Internet documents into giant master indexes of sites. These spiders visit URLs submitted by Webmasters or discovered in already-collected documents, and return page contents for indexing.

Online search engines and their agents are faster, friendlier and smarter-far more sophisticated, too-than the casual, sometimes sloppy student projects from which they arose. Some sites offer 200 or more distinct utilities, such as telephone and address finders, reference guides and news services. Major engines such as AltaVista, HotBot and Infoseek are built around high-speed fiber communications and scalable server architectures. As the Web grows, search tool administrators can simply scale up the server with additional processors, memory and disk space to handle the increased load. Couple that with new high-performance software, and the speed and efficiency of your search is in your hands.

The Web's explosive growth does have a downside, though. Estimates at the beginning of this year put the Web at anywhere from 19 million to 50 million separate URLs, or Web pages. Analysts suggest that the Web is now doubling in size every 100 to 125 days; at that rate, we could see well over 150 million Web pages by the end of the year. Even the most powerful search engine will have a difficult time keeping up, and it's likely a single search engine won't contain all, or possibly even most, of the Web documents available to answer a user's query. So, to perform a comprehensive, fruitful search of Internet resources, you'll need to understand search engine technology well enough to choose the appropriate tools for a particular topic and build highly refined, effective search queries.A search engine's goal is to maintain an up-to-date index, so its spiders must continually revisit a site to find and re-index changes. Spiders will revisit sites anywhere from every few hours to every few months, depending on the popularity of the site or the number of times a page changes within a given period. Many search engines judge a site's popularity by counting the number of external sites linked to it.

Search engines' spiders don't confine their travels to the World Wide Web. Lycos and AltaVista, for example, also check Usenet news archives, ftp sites and gopherspace. Galaxy searches Hytelnet, a catalog of telnet sites. But all engines come in four basic flavors: keyword indexes, subject directories, specialty indexes and meta search tools. Many major search engines, such as Yahoo, Lycos and Excite, are combining these flavors by acquiring additional engines. Yahoo, for instance, now uses AltaVista to provide keyword searches. And Excite offers fully reviewed sites in its subject directory, as well as a range of specialty and keyword indexes.

Keyword indexes, such as WebCrawler, AltaVista, Open Text and Infoseek, collect and index text they find at a Web site. These engines typically build a master index of millions of Web documents; AltaVista, for example, boasts 56 million URLs in its 40GB index. Keyword engines search the index for matching terms, returning document URLs and often the first few lines or a short description of the page. Most keyword-type search engines are extremely fast at responding to queries, mainly because they perform little or no content analysis. Document lists, or result sets, simply indicate whether one or more of the search terms appears within a Web page.

To overcome this shortcoming, keyword indexers assign relevance scores to the pages in the result set. In general, the highest scores go to pages that include the search terms in the title, page heading or HTML-code meta tag. Keyword indexes favor multiple occurrences as close to the top of the document as possible.

Subject directories, on the other hand, prescreen Web documents for content, assigning them to specific topical categories you can browse. Yahoo, one of the more popular subject directories, displays Web sites in outline form, beginning with broad topics such as Computers or Government and drilling down to narrowly focused sites. Yahoo's principal directory doesn't rate finds for content quality, but tools within its site do. Other subject directories, such as Excite and NetGuide Live, employ teams of reviewers to evaluate a site for overall quality of presentation and information, and ease of navigation.

Although subject directories can't match the breadth of a keyword index, they can reduce the likelihood of turning up irrelevant sites. They also provide warnings for sites that are browser brand exclusive, heavy on graphics or for adults only. They're not perfect, though; they force you to rely on the reviewer's judgment of a site, and may pigeonhole a broad range of sites into sometimes inappropriate slots. Many directories cater to a more general audience, so they may overlook data on obscure or exotic topics. And because they're geared toward browsing, not searching, subject directories may offer fewer advanced search tools.

Specialty indexes confine search objects to a particular topic or type of information. City.Net, for instance, is a collection of urban information depots with everything from restaurant information to the skinny on upcoming events. Deja News searches the Usenet news archives, OKRA confines its searches to e-mail addresses, and MovieLink focuses on films, returning local theater information by zip code. These can be great tools if you find one specific to your topic. They may use less-advanced technology and may be slower than the leading general engines, but what you give up in performance you get back in a high degree of relevance.

Meta search tools, or meta indexes, go the single-engine search route one better by sending your query to multiple search services simultaneously, combining the results into a single, relevance-ranked list. Highway 61, for instance, searches Yahoo, AltaVista, Lycos, Infoseek, Excite and HotBot. Savvy, another meta search utility, checks no less than 28 tools, from Lycos and AltaVista to the Internet Movie Database and LookUP!, an Internet address-finder. Meta-searching is a great time-saver because you don't have to visit and query each site separately, and you can sweep the Internet for documents you may have missed with a single-engine search.

Yet meta indexes can be extremely slow because they have to wait for results from multiple servers. An unusually overloaded or out-of-commission server can hold up an entire search. There are ways to "stream" searches so that faster servers' results display before slower returns arrive, but many engines don't use them.

Meta tools also keep individual engines' help and tutorial files out of reach. These documents often provide valuable assistance in building precise queries. But you must visit the actual search site to use them. Worse, most meta search engines stick to member engines' default query styles, which keep you from using any advanced options. You may be better off with a new class of search tools called personal spiders (see Not-So-Secret Agents).

Systematic Searching

Even with all these options at their disposal, most users still go no further than the simple default search, and wind up filtering tens or hundreds of documents manually. AltaVista, Lycos and Magellan default to "OR" searches, which simply return any document that includes one or more of the search terms. On the day I queried these services for information on mint-condition, Lincoln wheat pennies by keying in mint condition Lincoln wheat penny, I got back articles on candy, bread, New York's Lincoln Tunnel, herbal tea and air conditioners, without a single hit on coins in the first 25 documents. I refined the search with advanced search tools, as shown in the following examples, and dramatically increased the precision score (the percentage of relevant documents in a result set) so that only one of 25 documents contained no subject matter on pennies.

Building such precision queries isn't difficult. First, you'll need to understand the conventions each engine uses. You'll find this information in a search site's help files. Advanced query tools generally use the following conventions:

Boolean searching. Boolean search terms include OR, AND and NOT (all uppercase). Adding OR to a query, as in term1 OR term2 OR term3, broadens the search to include any document with any of these terms, regardless of their relevance to the topic. To satisfy an AND search (term1 AND term2 AND term3), a document must contain every term in the query. NOT excludes terms that make the document irrelevant to your search, as in term1 NOT term2.

Let's say you're looking for information on a Lincoln wheat penny, but more particularly need data on relatively rare 1909 one-cent pieces minted in San Francisco (1909S), not Denver (1909D). You might frame your search as penny AND 1909S NOT 1909D.

Some engines permit nested queries, so you can structure the order in which they will search. Lincoln AND (cent OR penny), for instance, can retrieve relevant documents that refer to pennies as cents, or vice versa.

AltaVista, Magellan and Infoseek use a plus sign in front of a word, without spaces, to indicate a term that must be included in a hit. A minus sign in the same position indicates a term to be excluded.

Lycos gets around its lack of Boolean operators by permitting you to specify the number of search terms a document must contain to be counted as a hit. But you can't specify which of the terms must be included.

Proximity. Many engines let you specify the distance one search term can be from another within a document, so you can weed out long files that may use single words in different contexts. NEAR requires a second term to be within a certain number of words; FOLLOWED BY specifies documents in which search terms occur in a certain order. ADJ finds terms adjacent to one another.

Query by Example (QBE). More and more search engines let you highlight a particularly useful document in a result set and click on a "More like this," "Find similar" or QBE button that automatically builds a search from all keywords in that document. QBE searches are great time-savers because they may find search terms that didn't occur to you. They're a great way to zero in on highly relevant Web sites, and will likely become an accepted standard for general search engines. WebCrawler and Excite already offer these searches.

Phrase or string searching. Placing quotation marks around a phrase lets you search "Lincoln wheat cent" as a single phrase rather than individual words. But it must find an exact match; it won't count "Lincoln cent (wheat)" as a hit.

Field-based searching. Some engines, such as AltaVista, Open Text and Infoseek Ultra, let you drill down to the actual HTML field for high-precision searching. For instance, you can search for every page with the word "padparascha" in the title by typing title:"padparascha" into the query form.

Searching backward. Still relatively rare, this search lets you monitor the popularity of a URL on your Web server by using field-based searching. You can give Infoseek Ultra and AltaVista the query, and they'll return information on external sites that link to WINDOWS Magazine's Web site. This is valuable information because it helps you gauge a site's popularity.

Word variants. Some engines search for plural and singular forms of a term. Most are not case-sensitive, finding occurrences of "NEXT," "Next" and "next" when they search for NeXT Computer, but there are a few that support case-sensitive searches. Many, including Lycos, search for extensions to the root word, so that a search for "blend" may also turn up "blender," "blending" and "blended."

Tips for Better Searching

Once you've learned the basics of advanced searching, a few additional tips can greatly enhance the effectiveness of your queries.

The Foremost Finders

You'll need more than one Web search tool in your research arsenal. Here's a quick rundown of the most popular engines:

Good search tools such as these can be a Web surfer's most powerful resource. Learn to use them properly, and you won't find yourself searching in the dark.