Introduction

Language Structure

IB Statements

File System

Comet 32 Runtime

Index

Retrieving web pages via HTTP

 

Overview

 

You can use the Winsock gateway to retrieve web pages from any web server. This is the function performed by the so-called “spider” programs employed by the leading search engines. (These programs are also called “crawlers” and “bots.”)

 

This process uses HyperText Transfer Protocol (also known as HTTP; see RFC 1945 for more information) to retrieve a web page from a web server for processing by your Internet Basic program.

 

Here's an example of how to create a spider. Write a program that opens the Winsock gateway, connects to a domain name, and gets a web page. The contents of that page, including HTTP headers and the raw HTML document, are returned to the Winsock gateway. Your program can read and process this data.

 

Here are the detailed steps:

 

1.         Open the Winsock gateway

 

2.         Connect to port 80 (the web server port) of a domain name using the CONNECT control (a Winsock gateway command).

 

            Example

 

            The following line connects to port 80 of the signature.net domain.

 

            Result$ = CONTROL(lun,"CONNECT signature.net 80")

 

3.         Get a web page from the domain name, using the GET command (an HTTP command). The syntax of the GET command is:

 

GET web-page HTTP/1.0

 

            Example

 

The following statement gets the web page named test.htm from www.signature.net.

 

            Print (lun) "GET  http://www.signature.net/test.htm HTTP/1.0"

 

4.         Print a "CR/LF" to the gateway (which represents a blank line between the header and body of the message).

 

            Print (lun) "@0D0A@"

 

5.         At this point, your program can read records from the Winsock gateway. These records contain the HTTP headers and raw HTML document.

 

            Example

 

            The following data was returned from the test.htm page. Notice the HTTP headers at the beginning, followed by the contents of the web page.

 

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Mon, 15 Apr 2002 21:44:57 GMT

Content-Type: text/html

Accept-Ranges: bytes

Last-Modified: Mon, 15 Apr 2002 19:53:03 GMT

ETag: "40136827b7e4c11:82e"

Content-Length: 115

 

<HTML>

<HEAD>

<TITLE> Test Document </TITLE>

 

</HEAD>

 

<BODY>

 

This is a test document.

 

</BODY>

</HTML>

 

We leave it to the Internet Basic programmer to determine how to process the contents of a web page. Some ideas include saving the contents to a Comet data file, searching through the contents for a particular search value, etc.

 

We suggest that you experiment with this capability by getting web pages from a known source (your own organization's web site, for example) and searching for known data values on each page.

 

See the following web page for a demo program that shows how to get a web page:

http://www.signature.net/download/Demos/