You are on page 1of 30

Basic HTML

What and Where


Our biolinx computer has a web server on it. Apache is the brand
name: it is Open Source software, and it is probably the most
common web server in existence.
From a practical point of view, the web server makes all files located
in the/srv/www/htdocs/biolinx/html directory (and any subdirectories under it) visible to the World Wide Web. Pointing your
web browser at http://biolinx.bios.niu.edu gives you access to this
directory.
For example, look at the hello.html file from within biolinx
(/srv/www/htdocs/biolinx/html/hello.html) and from your web browser
( http://biolinx.bios.niu.edu/hello.html ). They are the same file! Try
comparing the source code using View Source in your web
browser. However, we can manipulate the file from inside biolinx;
from the Web all we can do is look at it.
You each have your own sub directory for HTML:
/srv/www/htdocs/biolinx/html/z012345 (or whatever your z-number
is), viewed through the web as http://biolinx.bios.niu.edu/z012345 .
Put all your HTML documents in this directory.

What is HTML
Hyper Text Markup Language is a markup language. It
is a set of instructions to your web browser to cause the
text to be displayed in a certain way.
HTML is not a programming language in that it doesnt
allow decisions (if statements) or loops.
You can see what the actual HTML document looks like
(as opposed to how it is displayed) using the View
Source control on the browser.
HTML is a subset of SGML, Standard Generalized
Markup Language, which is a generic way of
representing any document. SGML is more or less too
complicated to be useful, but it has spawned two
important subsets, HTML and XML (which we will
discuss later.

HTML Standards

HTML is an evolving language. I am presenting approximately HTML


version 3.2, which is quite simple but which should work with all current
browsers. We want to be able to generate HTML documents on the fly,
from programs written in Perl, to display data dynamically. This is best done
using simple HTML rather than the more complex forms used by large
commercial web pages.
HTML 4.0, a more recent version has deprecated many of the tags that
determine style (notably the <font> tag), and asks that you put style
information in Cascading Style Sheets. Despite the deprecation, billions of
web documents were (and continue to be) written without style sheets. For
this reason, all browsers continue to support older version of HTML, and will
do so for the indefinite future. However, HTML 4.01, which was released in
late 1997, is the current standard for the web.
Deprecated means that there is a newer and better way of marking up the
information than the old tag. However, deprecated tags still work.
Obsolete tags may not work.
XHTML (Extensible HTML) is still being developed. It is an attempt to
convert HTML into XML. Version 1.0 has been released.

Document Type Definition


HTML standards are defined in documents called DTDs (document type
definitions). There is a default DTD used by the browser, and thus we dont
have to explicitly define a DTD. All XML documents come with a separate
DTD file.
If desired, we can explicitly used a DTD by starting the HTML file with the
line:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
This line says that the document follows the guidelines of the World Wide
Web Consortium (W3C) transitional standards for HTML 4.01. Transitional
means that some HTML 3.2 is still involved. W3C is the body that sets
standards for the web.
However, you should be aware that approximately 90% of the browsers
these days are Microsofts Internet Explorer. This semi-monopoly allows
Microsoft to ignore standards or create its own at will.
In practical terms, a web site that displays correctly for both Internet
Explorer and Mozilla Firefox will probably cover just about all situations: IE
because of the above-stated Microsoft 900-pouind-gorilla problem, and
Firefox because it follows the W3C standards that all other browser use.

HTML Tags
The basic feature in HTML documents is the tag.
Tags are set off by angle brackets (< and >), with the
tag name between them. For example, the entire HTML
document is placed between the opening tag <html> and
the closing tag </html>.
Most tags occur in pairs, indicating what is supposed to
happen to whatever text is between them. The closing
tag has the same name as the opening tag, but the
closing tag stars with a slash (/). For example, <b>make
this bold</b>. The text between the <b> and </b> tags
is made boldface by the browser.
Pairs of tags are supposed to be nested: you close all
inner tags before closing outer tags. Thus,
<b><i>bold and italicize</i></b> CORRECT
<b><i>bold and italicize</b></i> WRONG

More on Tags
Opening tags often contain attributes as well as tag
names. Attributes are separated from each other by
spaces, and they are in the form of: name=value. For
example: <h2 align=center>Title</h2> creates a
centered headline. The default is left-justified.
HTML tags are case-insensitive: <table>, <TABLE>, and
<tAbLE> are all equivalent. However, the current
XHTML standard suggests that we should use small
letters: <table>.
Some tags dont have a closing tag. <br>, a line break, is
a common example. The XHTML standard suggests
putting a slash into the single tag in these cases: <br />.

Character Entities
The other commonly seen feature in HTML documents is the
character entity, a group of characters starting with & (ampersand)
and ending with ; (semicolon). The entity represents a single
character in the browser display.
For example &gt; represents the > greater than sign. Since > is part
of each tag, browsers have a hard time displaying the actual >
character. By having &gt; in the HTML document, the browser will
display the character you want and not try to interpret as part of a
tag.
Very useful is &nbsp; , a non-break space, which is how you get
multiple spaces. If you just use the space bar, HTML browsers will
compress all those spaces into just 1 space. So, to get multiple
spaces, use several &nbsp;
All entity tags have a number: &#62; is the same as &gt; . Not all
have a mnemonic name.
All characters have entity tags, but most are rarely used. Thus,
&#97; represents the letter a. There is no mnemonic tag for this
letter; mostly we just type in the letter itself.

HTML Document Structure


HTML documents are supposed to have the form of a
tree, or equivalently, in the form of a set of nested tags.
The document should open with <html> and close with
</html>
Within the <html> tags are 2 sections: <head> ...
</head> and <body> ... </body>.
In the head section is a <title> ... </title> line. The title is
displayed at the very top of the browser window.
The body section contains all the tags and text that are
displayed in the main window.
See the Basic HTML Commands web page
(http://www.bios.niu.edu/johns/bioinform/htmlcom.html )

A Few Tags
Headlines are within tags like <h1> ... </h1>. H1 is the
largest, H6 is the smallest. The align attribute can be
used to move the headline: <h1 align=center> or <h1
align=right>. The default is left alignment.
Text is set off in paragraphs within <p> ...</p> tags.
Note that the closing tag is often left off. However, that is
a sloppy practice that I discourage.
The <br> or <br /> tags introduce line breaks: less space
between lines than with <p>. There is no ending tag for
<br> it is considered part of the previous <p>
paragraph.

Lists and Tables


<ul> starts an unordered (bulleted) list;
<ol> starts an ordered (numbered) list.
Items within the list are set off with <li> ...
</li> (list item) tags
<table> starts a table. <table border> puts
a border around it. Tables are built row by
row, and cell by cell within each row.
Table rows are <tr> ... < /tr>. Cells within
rows are <td> ... </td>

Images
Images are placed with <img> tags, with no
closing tag. The basic syntax is:
<img src=source_file title=tool tip text>
The src= value is a local file, the path to a file in
a different directory under the HTML root
directory, or a URL.
The tool tip text is displayed when the mouse
hovers over the image, or if for some reason the
image wont display. It is also very useful for the
visually impaired.

Links
To put in a hyperlink, the anchor <a> ... </a> tag
is used. Syntax:
<a href=URL>text to use as link</a>
You can also use an image between <a> and
</a>. In this case, clicking on the image sends
you to the linked URL.
If the linked page is on the same server, you can
just use the file name, or the path to the file
name, as the URL. However, if the linked page
is on a different server, you should use the entire
address, including the http://, as the URL.

Comments
Anything within <!-- your comment --> is a
comment: it is not displayed in the browser
even though it appears in the source
code.
Comments can be many lines long.
Note that there is no real closing tag: the
entire tag is enclosed within the opening
<!-- --> tag.

Forms

The form tag <form> ... </form> is used to send user-specified information
back to the server. The server then sends back its response, a new HTML
document.
The form tag itself needs at least 2 attributes, the action attribute and the
method attribute.
Although there are other methods, we generally use method=post for our
interactive programs.
The action of a form is the program on the server that the forms contents
are sent to. That program processes the information and returns the
response document.
Only programs in the cgi-bin directory can be processed under our system.
Thus, a typical form tag will look something like:
<form action=/cgi-bin/bios546/hello.cgi method=post> ...form
contents...</form>
Note that since the program that responds to this form is on the same
server, the actions URL doesnt need to contain http://biolinx.bios.niu.edu.
However, it does need to start with /cgi-bin.
The form sends name=value pairs to the server. name and value are
both specified within each form element.

Basic Form Elements

All forms need a Submit button: clicking this button


sends the form to the server. Syntax:
<input type=submit value=button label>. If you dont
specify a value, the button is labeled Submit by default.

Radio buttons: You typically use them in groups, all which


have the same name but different values. Only one button
can be checked; the parameter is given the value
associated with the checked button. It is possible to have
one button checked as a default, by putting the word
"checked" after the value=par_value statement.
<input type=radio name="parameter value="par_value">
The parameter specified by the value attribute in the
checked radio button is sent to the server.

More Form Elements


Check boxes: If checked, the value TRUE is sent to
the server. If not checked, neither name nor value is
sent to the server. If you want it checked by default,
include the word checked within the tag.
<input type=checkbox name="parameter">
Text boxes: if you want to enter a single line of text.
Whatever is typed into the box gets sent as a string to
the program given by the form action mentioned above,
as the value of a parameter whose name is given by
"name=". You can change the size of the text box with
the attribute size; its value is the number of characters
that can be displayed:
<input type=text name="parameter size=25>

Select Boxes
Select boxes: a drop down list of options. It has a
different syntax than most of the other input tags: <select
name=parameter> ... </select>.
Each option in the select box is specified by the
<option> ... </option> tag. When the form is submitted,
the text between the opening and closing tags is sent as
the value of the parameter specified in the <select
name=parameter> tag.
By default only 1 option is displayed. You can use the
size=number attribute in the <select> tag to display as
many options as you want.
To allow the user to select multiple options, use the
keyword multiple in the <select> tag: <select multiple
name=whatever>
A default value is created by adding the keyword
selected to the option tag: <option selected>this one!
</option>

A Basic Form
<html>
<head>
<title>Basic Form</title>
</head>
<body>
<h1> Basic Form</h1>
<p><form action=/cgi-bin/bios546/hello.cgi method=post>
What is your name?<input type=text name=your_name>
<br>Please select your favorite color:
<select name=color>
<option>Red</option>
<option>Blue</option>
</select>
<br /><input type=submit value=Click Me!>
</form>
</body>
</html>

Processing Forms
Once a form is submitted, it is sent to a specific program on the
server.
This procedure uses the Common Gateway Interface, or CGI. The
programs run under the CGI are called CGI scripts. We will be
writing ours in Perl, but other languages are also used.
In our configuration, programs that process forms must be located
under the CGI root directory: /srv/www/htdocs/biolinx/cgi-bin. You
have a personal directory under this.
For example, the hello.cgi program is located at
/srv/www/htdocs/biolinx/cgi-bin/bios546/hello.cgi
As with HTML addresses, this program has an alias used as the
action attribute of the form tag: <form action=
http://biolinx.bios.niu.edu/cgi-bin/bios546/hello.cgi method=post>

CGI Basics
CGI programs are simply Perl programs with a
few minor modifications that alter input and
output.
A key point: you need to change permission on
your CGI programs so that anyone can execute
them. When going through the Web, you are the
anonymous user nobody.
Any program in your CGI directory can be run
through the CGI interface (i.e. invoked through a
form on an HTML page). I often use the .cgi
extension on my programs just to remind me
that they are meant to be used on the Web.

Input to CGI Programs


To get input, we use the CGI module. Near the
top of the program, put in use CGI;, just as you
would put in use strict;.
The CGI module is a complex thing that allows
you to do many interesting things, but I prefer to
use only the simplest functions in it.
The CGI module uses object-oriented syntax.
Nothing mysterious about this, it is simply an
alternate way of writing things down.

Input Parameters
To get parameters from the form into a CGI program,
you first need to create a new CGI object with the
command:
my $cgi_obj = new CGI;
Then, each parameter on the form needs to be captured
into a Perl variable.
my $var1 = $cgi_obj->param(parameter1);
my $var2 = $cgi_obj->param(parameter2);
The parameter names are the values of the name
attributes in the various form elements.
You then process the input parameters as you would any
other Perl variables.

CGI Output
All print statements in programs in the cgi-bin directory
have their standard output re-directed to the web server.
That is, you send information back to the submitter of the
form by simply printing it.
One small qualification: in order for your browser to
understand that this is HTML, you need to print the line
Content-type: text/html\n\n at the beginning of the
printing. Note the \n\n: there MUST be a blank line
between the Content-type line and the <html> tag that
starts the actual document.
Otherwise all printing is exactly as we have described for
other Perl programs.
Note that you must print an HTML document to get a
good display!

Multi-line Printing
Sometimes called a here statement,
because you print down to here.
The statement print <<WZRT; causes
every line from that point to where WZRT
appears on a line by itself to be printed,
with no need for \n or any other format
commands.
Variables are interpreted as usual.

File Permissions

When you access a CGI program through a web browser, you are an
anonymous user with minimal permissions to do anything. Even though
you think you are you, the owner of the program, the web browser causes
you to become anonymous.
Thus, you must grant execute permission on your CGI file to everyone:
chmod 755 program.cgi.
More complex is the problem of using a CGI program to write to another
file. Three things need to be done:
1.
2.
3.

Create the file you wish to write. touch


/srv/www/htdocs/biolinx/html/z123456/prog_results.htm. The touch command
creates the file without putting anything into it.
Change the permissions on that file so anyone can write to it: chmod 666
prog_results.htm.
Be sure to use the full path to that file. Typically, the CGI file is in
/srv/www/htdocs/biolinx/cgi-bin/z123456 and you are writing an image file at
/srv/www/htdocs/biolinx/html/z123456. So, in the printed output from your CGI
program, access the image file with a tag like <img
src=http://biolinx.bios.niu.edu/z123456/my_image.png>.

Useful Debugging Tools


The CGI::Carp module sends error messages to your
browser. If you dont use it, you get cryptic Internal
Server Error messages with no debugging information.
Syntax:
use CGI::Carp qw(fatalsToBrowser);
on the biolinx command line, perl -c your_program.cgi
checks the programs syntax. It will either return syntax
OK or an error message. This allows checking the
syntax without having to run the program.
Remember that running a program through the web
means that you are the anonymous user nobody, who
has very few privileges. Be sure to check permissions,
especially if your program writes to any files.

Recap of CGI Processing of Forms


Start with an HTML file in your HTML directory:
/srv/www/htdocs/biolinx/html/z012345/prog1.htm.
This HTML file can be accessed through the web using a
web browser, at the URL:
http://biolinx.bios.niu.edu/z012345/prog1.htm
The HTML file contains a form, whose action sends
parameter name=value pairs to a CGI program on the
server:
<form action=/cgi-bin/z012345/prog1.cgi method=post>
The CGI program prog1.cgi is a Perl script located in
your CGI directory:
/srv/www/htdocs/biolinx/cgi-bin/z012345/prog1.cgi

Recap of CGI Processing of Forms,


pt. 2
Your CGI program contains the lines
use CGI;
use CGI::Carp qw(fatalsToBrowser);
at the top, just below the #!/usr/bin/perl -w line.
You must first create a new CGI object:
my $cgi_obj = new CGI;
Parameter values from the form are put into Perl
variables using object-oriented syntax:
my $var1 = $cgi_obj->param(parameter1);
The Perl variables are then manipulated by the program
as you see fit.

Recap of CGI Processing of Forms,


pt. 3
Output is printed just as in any other Perl
program, except that it is re-directed to the
web browser that requested it by
submitting the form.
Output needs to have the line
Content-type: text/html\n\n
at the beginning of the output.

You might also like