Introduction | Language Structure | IB Statements | File System | Comet 32 Runtime | Index |
Retrieving
web pages via HTTP
Overview
You can use the
Winsock gateway to retrieve web pages from any web server. This is the function
performed by the so-called “spider” programs employed by the leading search engines.
(These programs are also called “crawlers” and “bots.”)
This process uses HyperText
Transfer Protocol (also known as HTTP; see RFC 1945 for more information)
to retrieve a web page from a web server for processing by your Internet Basic
program.
Here's an example of
how to create a spider. Write a program that opens the Winsock gateway,
connects to a domain name, and gets a web page. The contents of that page,
including HTTP headers and the raw HTML document, are returned to the Winsock
gateway. Your program can read and process this data.
Here are the
detailed steps:
1. Open the Winsock gateway
2. Connect
to port 80 (the web server port) of a domain name using the CONNECT control (a
Winsock gateway command).
Example
The following line connects to port
80 of the signature.net domain.
Result$
= CONTROL(lun,"CONNECT signature.net 80")
3. Get
a web page from the domain name, using the GET command (an HTTP command). The
syntax of the GET command is:
GET
web-page HTTP/1.0
Example
The following statement gets the web page named test.htm from www.signature.net.
Print
(lun) "GET
http://www.signature.net/test.htm HTTP/1.0"
4. Print
a "CR/LF" to the gateway (which represents a blank line between the
header and body of the message).
Print
(lun) "@0D0A@"
5. At
this point, your program can read records from the Winsock gateway. These
records contain the HTTP headers and raw HTML document.
Example
The
following data was returned from the test.htm page. Notice the HTTP
headers at the beginning, followed by the contents of the web page.
HTTP/1.1
200 OK
Server:
Microsoft-IIS/5.0
Date:
Mon, 15 Apr 2002 21:44:57 GMT
Content-Type:
text/html
Accept-Ranges:
bytes
Last-Modified:
Mon, 15 Apr 2002 19:53:03 GMT
ETag:
"40136827b7e4c11:82e"
Content-Length:
115
<HTML>
<HEAD>
<TITLE>
Test Document </TITLE>
</HEAD>
<BODY>
This
is a test document.
</BODY>
</HTML>
We leave it to the Internet Basic programmer to determine how to process the contents of a web page. Some ideas include saving the contents to a Comet data file, searching through the contents for a particular search value, etc.
We suggest that you experiment with this capability by getting web pages from a known source (your own organization's web site, for example) and searching for known data values on each page.
See the following web page for a demo program that shows how to get a web page:
http://www.signature.net/download/Demos/