[ Go to July 1997 Table of Contents ]|
-- by Tom Henderson
It's no longer enough just to have a Web site listing your company's latest products and sales information. Web surfers and intranet users want to enter your site, extract information and head to the next wave. As a Webmaster, your challenge is to take diverse content, digest it into a database and present a mechanism-a search option-that lets visitors rapidly find the information they want.
Until recently, presenting content to Web visitors was a frustrating experience for many Windows NT shops that lacked seasoned Webmasters. Each time a new Web page was added to a Web site, it could take hours, days or months for the new content to be indexed into a database that visitors could query against. Early search engines also had (and largely still have) less-than-stellar search techniques. Simple keyword searches only uncovered selected portions of the data Web visitors wanted. As hypertext viewing goes, it was primitive.
New and evolving search engine technologies (from the likes of Microsoft, Oracle and Verity), coupled with an expanding amount of supported data formats (from HTML to Microsoft Word documents), are making content administration and index updates easier than ever. There are server-specific engines and cross-platform search engines. Many of these are younger cousins of now-famous Web search engines such as Yahoo.
Search engine overview
A search engine's job is to connect Web site visitors to the information they want. Data queries against the search engine need to be expedited quickly, so your Web site can move on to the next transaction. The transactions have several stages. From a surfer's perspective, there's a data-entry box on an HTML page that lets the visitor type in keywords and phrases. Some sites also include modifiers that affect the scope of the query and requests for formatting the query results.
From the server side, the data is sent through an applications programming interface (API) to a search or index engine. The engine software considers the query (and any additional system defaults, heuristics and, of course, the database) and returns various results. The results can include a "sorry" screen (that is, the search turned up no results), or Web page data formatted by title or score (which indicates the likelihood that a particular Web page will include content related to the search). The returned data is then sent to the user as a formatted HTML screen filled with "hits." Each hit represents a document or another URL that the visitor can further drill down on.
At each step, site visitors make choices, unless they're confined by the HTML search form. As an example, a query on the word "computer" can return thousands of hits from an engine like AltaVista. Rather than queue those documents, AltaVista will limit the results to under 500 titles ranked according to default rules or those rules imposed by the visitor through query choices.
There are at least three prevalent APIs-Common Gateway Interface (CGI), Microsoft's Internet Server API (ISAPI) and Netscape Server API (NSAPI)-that support Web server queries and transactions. The three APIs are transparent to end users but are markedly different in their approach to handling Web searches.
CGI has its roots in UNIX. It brokers or "accepts" each search query, called a request, and spawns an independent action for each CGI request made. As a result, each concurrent query or request demands additional thread resources and CPU cycles on your Web site.
Microsoft's ISAPI is different. It uses a "common queuing area" to handle concurrent queries, instead of spawning an independent query instance for each user who's searching your Web site. This makes for more efficient use of your Web server's horsepower. Netscape's NSAPI also uses one instance to service multiple HTML flows, but these flows are divided into several subcomponents. As a result, some Webmasters consider NSAPI more difficult to use than ISAPI.
The platform for your Web site's index engine will typically require one of these APIs or a proprietary API that's specific to the engine. Some index engines, however, are compatible with all three of the aforementioned APIs.
Of course, visitors can't effectively search your Web site unless its content is indexed. Adding Web site data to your index database either occurs automatically (thanks to a process called spidering), or manually through ad hoc indexing or scheduled reindexing. Spidering is a process by which a search engine examines a defined area, usually a directory structure. The qualified (administrator-defined) content is digested into the index or search database, and the local Web content becomes available for searches.
Some index engines, like Oracle's ConText Option 2.0 for Oracle Universal Server, use a relational database as the information store for text. Once text data (ConText supports ASCII, HTML, Microsoft Word 6, and WordPerfect 5.x and 6.x) is stored in the Oracle database (as field updates using different types of fields), queries can be made against the data through traditional SQL query tools, including ODBC (Open Database Connectivity) and Visual Basic Links, Sybase PowerBuilder and Oracle's own form tools. Oracle also adds automated indexes with search capabilities not normally found in relational databases.
Traditional SQL is relatively standardized but isn't robust. That's because SQL is a "least common denominator" among client/server products and lacks extensions necessary to invoke features specific to the database product or its host programming language, such as Oracle's PL/SQL.
To compensate for SQL's shortcomings, Oracle's ConText adds synonyms, Soundex (a standard "sounds-like" algorithm), fuzzy match and proximity searches to the regular SQL language extraction format. You can construct traditional SQL queries to find rows (database records) using syntax that's familiar to database users (get empno from emp where empname= "fred jones";), but the ConText extensions add capabilities to find "phred" (a Soundex soundalike) or "fred jones near amy rice."
ConText's strength is its integration with a database management system (DBMS), which offers structure and security. Data added into such a system would have a natural or predictable storage and link-point. Also, Oracle supports both UNIX and Windows NT (the two most prevalent Web server platforms), and hundreds of query tools.
Other products, such as Verity's Search'97 Information Server, use a different approach. Search'97 is an engine that can connect to several different Web server products and their corresponding API foundations. For example, Search'97 will be bundled shortly with Netscape's SuiteSpot Web server family, which uses NSAPI. It's also possible to use Search'97 with CGI or ISAPI, and on Windows NT 3.51 and 4.0. The core Search'97 engine stays the same and hooks to whichever API the Webmaster prefers.
Search'97's multi-API support has benefits as well as potential drawbacks. On the upside, you can deploy your Web site and its search engine on a low-end NT server, and later move-if necessary-to a midtier symmetric multiprocessing (SMP) server running NT or to a high-end Sun Solaris server using NSAPI. Search'97 also formats query results that are returned to Web site visitors. As with Oracle's ConText, the Search'97 database engine doesn't need to reside on the Web server where the query forms are located, and Search'97 supports multiple instances of itself.
On the downside, Search'97 supports Microsoft Internet Information Server (IIS) 2.0 but not IIS 3.0. Considering that Verity is bundling Search'97 with Netscape's SuiteSpot, the company's focus on IIS might be further limited.
Still, Search'97 is tightly integrated with at least one Microsoft application. It can index Microsoft Exchange Server public folders and outperforms Exchange's Find mechanism, which is activated by pressing Ctrl+Shift+F5. Another Verity offering, Search'97 Personal, brings powerful search capabilities to your own hard drive.
Windows 95, NT and UNIX users can organize and index their mail, attachments, folders and documents. A built-in viewer lets you see documents in their native formats (for example, ASCII, HTML, Adobe Acrobat's Portable Document Format [PDF] and word processing documents) with highlighted search terms.
Back on the server, the Search'97 engine supports several data import file formats missing from Oracle's ConText, including the popular PDF. Verity uses Inso Corp.'s QuickView (formerly called Mastersoft Filter and Viewer Kit) to support a whopping 200 different file formats.
During installation of Search'97 on NT, an Installation Wizard checks the infrastructure, then allows Webmasters to set resource allocations, and define search defaults (by title, for instance) and the minimum score needed for documents to be displayed. By requiring higher scores, you'll generally ensure that queries return fewer documents-and therefore reduce the server resources demanded by concurrent queries.
Microsoft Index Server
Microsoft now offers its own query software, Index Server, for IIS. Though it ships with IIS, Microsoft also posts upgrades to the index software separately on the Internet. The latest release, Index Server 1.2, runs only on NT but has at least one huge advantage over its competition: It's free.
Index Server automatically digests Microsoft Word, Excel and PowerPoint documents, as well as text and HTML pages. It scans NTFS and FAT partitions for changed content and updates a data store (a database file in Index Server) consisting of normalized keyword links to the data files. While Index Server can also include non-NT resident data (such as data residing on a networked NetWare server), it requires periodic scheduled updates to ingest that foreign data.
Changing Index Server's default settings is a complicated task that requires adjusting Registry settings. Fortunately, Microsoft includes a wide variety of sample HTML query forms for you to use. Index Server doesn't consume much disk space; it requires only about a 10 percent premium over your indexed data. For instance, if you index 100MB of data, Index Server will likely consume just another 10MB. You can limit the type of data to be indexed by restricting the directories to be indexed and the files placed in those directories. For instance, you could limit file support to HTML data. Also, performance monitoring hooks are included so you can observe the activities and queue backlogs/fulfillment-an item missing in Search'97. But like Search'97 and ConText, Index Server offers a wide variety of query options, such as wildcard searches, linguistic capabilities and proximity searches.
If you've already installed Netscape Enterprise Server or SuiteSpot, Verity's Search'97 is the leading candidate for search-enabling your Web site. Verity's close ties with Netscape make Search'97 a nice fit for Netscape enterprises running on NT or UNIX. However, Microsoft Index Server is simply the better choice for IIS shops. Its tight integration with IIS-as noted, Search'97 doesn't even support IIS 3.0 as of this writing-performance-monitoring hooks and price (free) make it a perfect choice for IIS Webmasters. Where does that leave Oracle's ConText? It's ideal for NT or UNIX shops running Oracle, as Apple Computer, the U.S. Department of Justice, Fidelity and Lucent Technologies can attest.
Regardless of which Web server software you're running, beware: Visitors won't stay glued to your site if it lacks a powerful search engine.
Contributing editor Tom Henderson writes WINDOWS Magazine's NT Administrator column.
SIDEBAR: Selected Search Options
Product: ConText Option 2.0 for Oracle Universal Server
Company: Oracle Corp. 800-ORACLE1, 415-506-7000, Circle #800
Price: ConText Option $495, not including Oracle Universal Server
Platforms: UNIX, Windows NT Server 3.51 and 4.0
Advantages: Integrates content directly into a database; uses server security; handles nontext data; works with any Oracle Universal Server hardware platform; uses any access tool compatible with Oracle Universal Server.
Disadvantages: Requires expensive database purchase for non-Oracle shops; file formats somewhat limited; requires a database administrator
Product: Microsoft Index Server
Company: Microsoft Corp. 800-426-9400, 206-882-8080, Circle #801
Price: Free for NT Server 4.0 customers
Platforms: Windows NT Server 4.0
Advantages: Price; wide file format support; auto-update to indexes; somewhat advanced search features
Disadvantages: No UNIX or Netscape support; remote-server data indexes are manual; requires Registry entry changes
Product: Search'97 Information Server
Company: Verity Software, 800-935-6246, 408-541-1500, Circle #802
Price: From $4,995
Platforms: Windows NT, Server 3.51 and 4.0, UNIX, Netscape
Advantages: Tight integration with Netscape's SuiteSpot; extensive file format support; HTML administration
Disadvantages: No support for Microsoft Internet Information Server 3.0