By Martin Heller
FIRST, THERE WAS a Web browser on every desktop. Then everyone in the solar system had a home page. After that, every computer in the universe had its own Web server.
Maybe I'm exaggerating, but there's more than a grain of truth here. As more and more new Web server installations crop up, those running the installations will face static HTML limitations-and soon.
If you want to register users, take orders and provide searchable content, you'll have to design forms and write a program or script to handle their output. You may even want to write a program or script to generate the forms.
Designing HTML forms is fairly simple (see the related article in the Enterprise Windows section). The standard, portable way to process information from Web forms hosted by any Web server is via Common Gateway Interface (CGI) programs, and writing them isn't all that bad. If you're using Microsoft's Internet Information Server (IIS), you can use a more efficient, DLL-based interface to Web forms called Internet Server API (ISAPI). If you're using a Netscape server, you can use the similar Network Server API (NSAPI).
There are two ways a Web server can send a CGI program or script the output of a form: the POST and GET methods. If the Web form has METHOD="GET" in its FORM tag, the program gets the form's information in the environment variable QUERY_STRING. This variable contains the information that follows the ? in the script's URL.
Sometimes, the whole URL appears in the command line fed to a script. CGI command lines don't process queries from forms; they just process ISINDEX queries. To differentiate between the two, look for equal signs: Form output has unencoded equal signs; ISINDEX queries don't. For instance, the URL cgitest-.exe?name=Martin+Helleremail@example.com is a query from a form with fields name and email. The query finger? Heller is an index query that uses the finger command to look up Heller and return the output.
If it has METHOD="POST" in its FORM tag, the program gets the form input on the standard input stream-stdin in a C program. Unfortunately, you can't depend on finding an end of file mark in the data stream, so your program must use the environment variable CONTENT_LENGTH to determine how much data to read.
Usually, the data in QUERY_STRING or on stdin is URL-encoded, which changes spaces to pluses and encodes some characters into hexadecimal. The data from the named fields are placed into name=value pairs, and the pairs are separated by the & character. Some newer servers let you get binary-encoded instead of URL-encoded data, while some let you get encrypted data. For now, let's stick to URL encoding. To determine the type of data you've received, examine the environment variable CONTENT_TYPE. It will be application/x-www-form-url-encoded for the data I'll discuss.
Your CGI program can parse URL-encoded query strings by first splitting the data at the ampersands. That should give you separate strings of the form name=value. For each pair, separate the name from the value at the equal sign. Finally, convert plus signs to spaces and hex-encoded characters (of the form %xx, where the x's are hex digits) to the actual character. You'll now have all the names and values decoded into separate strings.
Returning information from a CGI script to the Web server and from there to the client browser is simple: Just send it to the standard output stream. The only trick is telling the server what type of data you're sending. Do this with an ASCII header. Any headers that aren't server directives are sent directly back to the client, so ordinary batch files, shell scripts and console programs will work, and the output will be displayed as readable, albeit ugly, text.
The three server directives defined in CGI 1.1 are Content-type, which is the Multipurpose Internet Mail Extensions (MIME) type of the document you're returning; Location, which tells the server you're returning a reference to a document rather than an actual document; and Status, which returns an error number and reason (my personal favorite is 404 Not Found). You'd use Content-type to say you were returning HTML or some other kind of document.
With such simple requirements-the ability to get and parse environment strings, read standard input and write standard output, run as a native executable or as a script associated with a native executable-there are no serious limits on the programming languages you can use for CGI scripts. You'll see them written as Perl scripts, C programs, Tool Command Language (TCL) scripts and UNIX Bourne shell scripts. They can run in almost any language the Web server's operating system supports, and you'll find CGI programs written in Visual Basic, Access and various platform-specific languages and development packages.
A Web search for CGI will unearth a wealth of resources for downloading. Among them are several Perl and C language libraries, a Bourne shell utility for parsing a URL-encoded data string into environment variables and a set of TCL routines for parsing form data into TCL variables. A good CGI tutorial is available at http://hoohoo.ncsa.uiuc.edu/cgi/.
No matter what language you use for CGI, you must still face security issues. The difference between writing desktop applications and writing Internet server applications is that you can usually assume desktop application users are friendly. They may cause some unintentional harm, but you can generally anticipate such problems.
Public server applications are another kettle of fish. Sadly, you have to assume some Web users are hostile. They may want to crack into the corporate network behind the Web site to steal information, or they might want to crash the Web site out of sheer mischief. So you have to write and test CGI applications with an eye toward possible malice.
The worst hole to watch out for in CGI scripts deals with the eval statement and special escape characters. It's mostly an issue in shell languages, including Perl. The eval statement lets you construct a string and pass it to the shell to interpret. If your string depends on user input, and the user includes a shell escape character, the effect of executing the resulting string can be disastrous. The moral is: Don't run user input as code-at least not without serious scrutiny.
You'll find many other Web security gotchas, among them server-side includes (don't enable them in script directories). Fortunately, there are those who actively track these gotchas and bring them to the attention of the Web administration community. You'll find answers to the frequently asked questions about Web security-at least those pertaining to UNIX- and Linux-based servers-at http://www-genome.wi.mit.edu/WWW/faqs/www-security-faq.html.
Aside from the security risks, CGI offers inherently poor scalability and efficiency. Each instance of a CGI script runs in its own process and address space, not in the Web-server process space. On 32-bit Windows systems, that means every invocation of a CGI script causes WinExec to run, a new copy of an executable and possibly a new copy of a script to be loaded from disk, and a new address space to be created with a new process and a new thread. Often, the script proper does very little, and the process creation overhead becomes a major portion of the overall CGI script time.
One way to improve the situation is to make a Web server extension that runs as a DLL and uses a separate thread of execution to handle each call from the server. This amounts to creating a thread for each client request in the Web server's address space-a far superior alternative to creating a process for each client request.
Two of the leading Web server vendors, Netscape and Microsoft, have published their own proprietary DLL-based Web-server extension schemes. Netscape's NSAPI also works on UNIX systems that support shared objects. Microsoft's ISAPI doesn't have to work on UNIX-Microsoft's IIS runs only on Windows NT Server.
Briefly, ISAPI DLLs have two required entry points, GetExtensionVersion and HttpExtensionProc. The former call lets the server know the extension DLL's version numbers and description string on initialization, and the latter call is the equivalent of the extension's main routine. Information is passed to HttpExtensionProc through its single parameter and extension control block pointer. The extension control block structure carries the major information that would be passed in environment variables to a CGI program.
The ISAPI program can request additional information by name through the GetServerVariable call, read information from the body of the Web client's HTTP request with a ReadClient call, send information to the HTTP client with a WriteClient call, and return locations, redirection and status information to the server with a ServerSupportFunction call. You'll find additional information on ISAPI in the Microsoft IIS SDK, downloadable from http://www.microsoft.com/intdev/.
NSAPI isn't as easy to summarize. Although similar to ISAPI, its form is more complicated, more flexible and more closely tied to the server setup. Each NSAPI function has to be configured in the Netsite object configuration database. NSAPI parameter blocks are based on name-value pairs, much like Web form variables. You'll find information on NSAPI at http://www.netscape.com/newsref/std/server_api.html.
When every computer in the universe finally does have its own Web server, there will be an interesting consumer market for server extensions. Anyone interested in starting a pool to predict when that will be?
Martin Heller surfs the Net and hacks code from Andover, Mass. Contact Martin at his Web page at http://www.winmag.com/people/mheller, via e-mail at firstname.lastname@example.org.
<H2>Di to Divorce</H2><BR>
Reported <B>$23M</B> Payoff<BR>
<I>Will Prince Re-wed? </I>
would display as shown here: