HTML Philosophy and Tutorial

Draft - still under construction

The purpose of this and the accompanying articles is not to teach the details of any specific part of the language. Such information is readily available from many sources.

You are reading this in order to learn how to think while using the language, how to write poetry and not doggerel, how to be a good programmer, coder, or author.

Web Design Philosophy

If you haven't read the introduction to web design philosophy, you should do that first. It covers the general philosophy to be applied to all languages, and you should understand it before reading this document.

HTML Philosophy

Use HTML to define meaning and intent.

The target audience is software. Use HTML only to mark up the natural data so that it can be understood by computer programs. Do nothing to control its eventual appearance or presentation.

Terminology

You are likely already familiar with these terms, but this will serve as a reminder.

data
is the original text of the document being marked up.
metadata
is something that describes the data.
tags
are the <name>text</name> metadata markers that are added to a document to declare the semantics of the enclosed text.
attributes
optionally appear within the opening tag: <tag name1="value" name2="value">. They consist of an attribute name, an equal sign, and a quoted value. Those values are the metadata.
self-closing tags
are tags that are purely metadata and never contain any text data. Also known as void elements, they are generally used to provide data that can't otherwise be represented as text (e.g. an image stored in an external file). Rather than having a closing tag they are written as a single marker with a slash immediately before the final >. E.g. <img src="..." />, rather than <img src="..."></img>.

What Markup Language Does

 <entry>
    <word>...</word>
    <pronunciation>...</pronunciation>
    <derivation>...</derivation>
    <definition>
       ...
       <quotation>
          ...
          <date>...</date>
          <author>...</author>
       </quotation>
       <quotation>
          ...
          <date>...</date>
          <author>...</author>
       </quotation>
    </definition>
    <definition>
       ...
       <quotation>
          ...
          <date>...</date>
          <author>...</author>
       </quotation>
    </definition>
 </entry>
			
Contrived dictionary markup.

A markup language is a language that enables you to mark up existing documents in a way that makes them more understandable to computer programs.

By the 1980s, the gigantic Oxford English Dictionary had become far too large to maintain by hand, but converting it into a computer database seemed like an almost impossible task. Simply converting the printed copies (nearly 23,000 pages) into computer files wasn't enough to be useful; the software needed to be able to separate individual entries, and to know which parts were the word, which the definition, etc. The answer was to use a form of GML to mark up the text, labeling the semantics so that it could be understood by the software. The result of that massive project was something like what is shown here, with the ... representing the original text.

Now any program could access this data, select the relevant entries, and display the results in whatever format was suitable. For instance, by selecting only those entries and definitions that contained quotations dated before 1700, one could produce a printed dictionary that would represent the English language as it appeared 300 years ago.

What Markup Language Doesn't Do

The original printed OED used different fonts and sizes to make the various parts of each entry more obvious to the reader. What is intelligent about the OED conversion process was that they did not attempt to preserve these fonts; instead they preserved the semantics. There is nothing in the markup to indicate what anything should look like, or even what order any of the parts should appear in.

And that is the significant concept you should learn from this. Markup language is intended strictly for communicating with a target audience that is nothing but computer software. That this software will eventually format the data into something that will appear on paper or on computer screens is irrelevant. If we attempt to communicate any information about formatting, we are defeating the purpose of the language and writing for the wrong audience.

That same concept is what you must follow when using other markup languages, in our case HTML.

HTML

The original version of HTML behaved exactly as it should, allowing one to mark paragraphs, lists of items, headings, etc. But a large number of problems, and corresponding bad solutions, soon appeared.

Different browsers formatted things differently.

Most differences were subtle, and so didn't matter, but some were significant. For instance, <h1> headers appeared too large in some browsers, and too small in others.

People wanted more control.

Programmers had spent the previous 20 years using languages such as troff, where everything was written in terms of format. Troff had given programmers complete control over every aspect of the formatting, but HTML had moved that ability into the web browser.

People still thought in terms of format.

People were used to putting .ti 5 at the beginning of each paragraph, to indicate that they wanted to indent the next line by 5 characters. Some defined their own .P macros to start a paragraph without producing widows or orphans.

Having to say where that paragraph ended didn't make sense, and saying that a block of text is a paragraph was a foreign concept. They knew how they wanted this specific paragraph to start, but couldn't see the larger picture.

People were lazy.

People were willing to put a <p> at the beginning of each paragraph, since that was similar to using the .P macro that they often used, but putting a closing </p> tag at the end seemed like a waste of time, so most didn't.

Some tags, such as <img> contain only metadata, and no real data, so entering a closing tag to supply empty data seemed like a waste of time.

Browsers catered to people's mistakes.

Instead of rejecting bad code, browsers simply accepted whatever was given to them and did the best that they could with it. It's illegal to nest a <p> inside another paragraph, so whenever such a situation occurred, browsers supplied the missing closing paragraph tag, as well as missing closings for other open tags within the paragraph.

Similarly, all kinds of nonsense ended up being accepted by browsers and turned into something that looked fine. People adjusted and corrected problems with their web pages by playing with the HTML until the formatted result looked okay. It didn't matter that much of their code was now the equivalent of illiterate gibberish, it worked, and that was what mattered.

Parsing required semantics.

Other markup languages, such as SGML could easily be parsed without the software knowing what any of the tags meant. But to parse HTML one now had to know the semantics of the tags, such as which could appear inside others, which didn't need closing tags, etc.

This not only made the software more complicated, it made it much more difficult for beginning programmers to understand existing code. They couldn't simply ignore the tags that they hadn't learned yet, as one needed to know their semantics in order to parse the rest of the code. I myself gave up on using HTML for a few years for exactly this reason; it seemed such a clunky and awkward language.

New tags and attributes were added for formatting.

In order to accommodate many of the perceived deficiencies in the language, new tags and new attributes were added simply to provide format control. This moved the language even further from its original intent. The resulting code was referred to as tag soup.

CSS and XHTML Save The Day

By the late 1990s, the whole language was a mess, totally missing its original purpose.

Fortunately, two new languages were introduced to rationalize the situation, XHTML to provide the original purpose of HTML, and CSS to provide the formatting capabilities. CSS will be discussed in a following document.

XHTML requires strict parsing rules. Rather than silently compensating for coding mistakes, browsers that detect syntax errors report them and stop processing the page, thereby almost guaranteeing that the maintainers will use correct syntax. It also means that web crawlers, such as those used by Google to create their index, can make much better sense of what a web page is about.

It's sad to say, but too many people learned to write web pages in the 1990s and still haven't changed their mindset, and since then too many new people have learned by following their bad examples.

Even if you do nothing else suggested in this document, you should always name your files with the .xhtml suffix, so that browsers will quickly point out any inadvertent mistakes you might make.

XHTML

XHTML isn't perfect yet, but it is so much better than what came before. With a little personal discipline, you can follow conventions that eliminate its remaining problems.

Avoid all formatting tags and attributes.

A few format related tags and attributes still remain in the language. The tags b, i, u, and small may be used, but only in a context where they have no significant meaning. E.g. in order to reproduce the original appearance of bold, italic, underlined, or small characters within a quotation.

Use <br /> only between groupings of paragraphs, but even then you should avoid this as much as reasonably possible (e.g. wrap each group in a section tag. (And definitely don't use the <p><p> trick that was so popular in the 1990s.

There is no need for the <hr> tag. Make use of bottom or top borders of the preceding or following object.

Never enter any attribute related to size, colour, position, font, or style. In general, it's good to avoid using most attributes, except for those that are non-optional, and class= (for CSS). If an attribute isn't related directly to meaning, don't use it.

Use semantic tags.

Some tags exist for semantic purposes only and have no obvious effect on a page. That doesn't mean that you shouldn't use them though. They can be of direct use to you when writing CSS (e.g. suppressing <nav> navigation information when printing paper copies), and of indirect use to you when scanned by web crawlers (e.g. Google's search index).

Doctype, Head, and Body

 <!DOCTYPE html>
 <html xmlns="http://www.w3.org/1999/xhtml">
 <head>
 ...
 </head>
 <body>
 ...
 </body>
 </html>
		
Basic parts of a web page

The data for a web page has two parts, the document type and the actual HTML.

The document type is normally the first line of the file, and exists simply to inform the web browser that the document contains modern HTML, as opposed to older HTML or some completely different language. Though <!DOCTYPE html> looks like an HTML opening tag, it isn't; there must be no matching closing tag, nor self-closing />. Note that the use of upper-case and lower-case is required.

The HTML section is in real HTML format, and contains two parts, the head and the body:

Head (think search engine optimization)

Many people used to omit the head section, but it really isn't optional. If you care at all about getting a good position in search indexes (e.g. Google), the more useful information you can provide in this section, the better your score will be.

 <head>
   <meta charset="UTF-8" />
   <title>...</title>
   ...
 </head>
				
Required parts of the <head> section.

The charset line tells the browser how the characters in the rest of the document are encoded. Many different encodings are possible, but generally there is no longer any point in using anything other than UTF-8, an encoding of Unicode, which in theory is capable of representing any character in any character set.

The title line is used to tell browsers what text to display at the top of the browser window or tab. In XHTML it's illegal to omit this line, and even in older HTML it was silly to do so, as search engines use this line to build their index. Don't expect to get many search hits if you don't supply a meaningful title.

 <head>
   ...
   <meta name="viewport" content="width=device-width, initial-scale=1" />
   <meta name="description" content="..." />
   <meta name="keywords" content="..., ..." />
   <link type="text/css" rel="stylesheet" href="style.css" />
   <style>
   ...
   </style>
 </head>
				
Other useful parts of the <head> section.

The head section can contain other useful information about the web page.

Include the viewport line if you know that your page will format well in small windows. Searches run on a mobile phone will give a higher score to pages marked this way, and that means more hits for you.

Similarly, the description and keywords lines are used by search engines. These not only help with the page's score, but the given title and description are usually displayed in the search results, giving you direct control over what people will be told about your page (think click through rate).

Since we have tried to eliminate any suggestion of output formatting in our HTML, the style and stylesheet tags are extremely important if we care about how our web pages are displayed. This will be covered in detail in the next document.

Body

header
   h1
   p
article
   p
   h2 section
      p
      h3 section
         p
	 p
      h3 section
         p
   h2 section
footer
   nav
			
A typical mostly-text web page body.

A web page body for a document that is mostly text (such as the one you are reading now) will typically have a standard structure. All parts are of course optional, but it's good to choose a consistent way of writing your documents, so that a common style sheet can easily be applied to them all.


The real information in the page is contained in the article part. The header and footer are used for human-readable metadata, i.e. they provide information about the article (e.g. abstract, summary, author, copyright). Some browsers might use this to allow the person reading your page to optionally see only the article, suppressing the other two sections.

Similarly it might suppress any aside divisions, since asides are additional information not essential to the article.


As mentioned already, the use of semantic tags, which don't by default affect the formatted appearance of the document, is highly recommended. Not only will it provide information about what the data represents to search engines (better score for you) and to browsers (better presentation to visually impaired readers), it will make it a lot easier for you to apply style sheets to give your pages a consistent and more interesting appearance.

Accessibility and Portability

Making web pages accessible (e.g. useful for visually impaired people) and portable (e.g. useful on different physical devices or with different brands of browser software) used to be difficult and required very complex solutions. But that was only because people went out of their way to accomplish it (e.g. specifying different font sizes under different conditions).

By following the principle of totally ignoring formatting and appearance when writing HTML, nearly everything that is needed is automatically provided by the web browsers themselves. And most of what is left to do can be accomplished by using useless semantic tags and attributes (e.g. the scope="col" attribute on a <th> tag).

Other minor details, such as having links too close together on a non-mouse device, can easily be handled in CSS, a topic in the next document.

From the HTML point of view, the only real concern is with the use of images and other visual objects that are inherently not accessible. In most cases these can be handled by appropriate use of alt attributes. Be sure that these alternative descriptions are meaningful. Don't say mountain, when what the image really means is Photograph of mountain, showing white snow at top, brown and gray cliffs in the middle, and green forest below the tree-line. The three bands have sharp horizontal separations.. Tell exactly what the image is (photograph of mountain), and what details the image is supposed to be illustrating (three distinct layers of terrain).

Not only will this make your page comprehensible to someone that is visually impaired, it will also make it useful for people that don't download the image because of slow or expensive network connections. And (for some people, most importantly), search engines themselves should be considered visually impaired.

Figures and Asides

The main difference between figures and asides is whether you consider them to contain information essential to the document or to have interesting but non-essential content.

A list of names or a photograph of the members of the City Council could be a figure in an article about recent zoning decisions. In the same article, an anecdote about a councillor's being a descendant of the first mayor could be an aside.

The difference becomes important in such situations as when a browser is told to display only the article, omitting non-essential parts like navigation and asides.

Images

Images should almost never appear by themselves, but as part of some containing object.

Visit the
<a href="https://www.w3.org"
    title="Web Standards">
  <img alt=""
    src="https://www.w3.org/favicon.ico" />
WWW Consortium</a>
for specific details.
			
Visit the WWW Consortium for specific details.
Image used as an inline icon.

Occasionally they are used inline with text, typically to provide a visual enhancement, not to provide essential information. For instance, within an anchor, you might want to display an icon related to the URL. In this case you would give an empty value to the alt="" attribute, since you don't want a broken image symbol to be displayed if the image is somehow unavailable.

More typically you would put images within a table element, a figure, or an aside. Give the image a meaningful alt= attribute, and use the <figcaption> tag to provide a reason for having the image.

For inline images, use CSS to limit the height of the images to the same size as the text, and to allow images to fill the fulll width (or percentage width) of a figure. (This is discussed in the next document.)

Tool Tips

The increasing popularity of touch-screen devices, which lack mouse cursors, has reduced the usefulness of the title="coment" attribute. It is still a good idea to include them though, for cursored devices, for internal documentation, and in the hope that future browsers will provide the equivalent of hover without a mouse cursor.

Be fluent in <abbr title="HyperText Markup Language">HTML</abbr> and <abbr title="Cascading Style Sheets">CSS</abbr>.
			
Be fluent in HTML and CSS.
The words HTML and CSS will display pop-up expansions.

It is especially useful with the <abbr> tag, allowing the user to see the full meaning of an abbreviation.

Whenever someone positions the cursor over an item that has a title= attribute, the browser displays a tool tip, a small pop-up box containing the associated comment. This provides instant help to the user, whether explaining the meaning of a possibly unfamiliar term, giving a reference for a quotation or fact, or describing the effects of selecting a button.

Some browsers indicate the presence of tool tip popups by means such as a dotted underline or a ? cursor. (For consistency across browsers, you can control this behaviour using CSS.)

Headings

HTML provides six headings, h1 through h6.

The most common mistake people make is to use these simply to produce different font types and sizes. That is style, not markup.

h1 HTML Philosophy and Tutorial
h2     Web Design Philosophy
h2     HTML Philosophy
h2     HTML
h3         Different browsers ...
h3         People wanted more control.
h3         People still thought ...
h2     CSS and XHTML Save The Day
h2     XHTML
...
			
Example heading outline.

You should always nest the headings, always using sequential numbering. Typically there will be a single h1 heading near the beginning of the page, several h2s dividing it into sections, possibly some h3s subdividing those sections, and so on.

Web crawlers use these headings to determine the outline of your document. HTML verifiers often display all your headers as a table of contents of your document. If what is displayed isn't a reasonable representation of what your document is about, you're not doing it right.

Lists

Some people have difficulty deciding which type of list is appropriate, often choosing the wrong one or resorting to a table instead. Generally though, you shouldn't have trouble deciding which to use. HTML provides three kinds of lists:

ordered lists <ol>
have numbers placed before each item.
They should be used when listing items that have a required order (e.g. steps in a recipe), or when other text needs to refer to specific items.
unnumbered lists <ul>
have bullet marks placed before each item.
They should be used when the order of items is not of importance (e.g. ingredients that can be added in any order).
definition lists <dl>
have specified names placed before each item.
They should be used when the order of items is obvious or irrelevant, and each item has a name naturally associated with it (e.g. this list).

???

...

Pitfalls

Keeping HTML simple and meaningful is mostly a matter of self discipline. But there are some situations where it is very easy to believe you are thinking in HTML but are actually doing something that should be left to CSS.

Layout

Generally, physical layout should be left to the browser, as configured by CSS. But occasionally, the physical layout actually does contain some of the meaning of the natural data.

The most common instance of this is when data is presented in a table, with a natural association between items in the same column or row. In such cases, use <th> tags to let people know what that association is, and provide the scope="..." attribute with each table heading tag to let software know how to apply the heading.

If you can't easily think of text for the headings or their scope attributes, perhaps a table isn't the appropriate markup to use. Possibly you are simply using the table to force a visual layout, or the headings to force a font change. That is the job of CSS, not of HTML.

Classes

The one place where you may take future layout and appearance into consideration is with the use of class="classname" tags. Use classes to provide semantic information that goes beyond what is provided by HTML itself.

But be aware that it is very easy to abuse classes to control formatting rather than to provide semantic values. Using <span class="warning-message">...</span> is a legitimate use of the class attribute. Using <span class="red-text">...</span> is not. Both may very well end up producing exactly the same effect, but there is a big difference between the two.

If you understand that difference, and believe it to be important, you have passed this lesson.

<noscript>

This is discussed in more detail in the JavaScript document, but generally, scripts should be used only to enhance web pages, not for basic content. Following that philosophy means that there is never an appropriate situation in which the <noscript> tag is needed. All your HTML should always be written as if scripts are not enabled.

???

...

Summary

...