FreeBSD Documentation Project Primer for New Contributors
Prev	Chapter 3. SGML Primer	Next

3.2. Elements, tags, and attributes

All the DTDs written in SGML share certain characteristics. This is hardly surprising, as the philosophy behind SGML will inevitably show through. One of the most obvious manifestations of this philisophy is that of content and elements.

Your documentation (whether it is a single web page, or a lengthy book) is considered to consist of content. This content is then divided (and further subdivided) into elements. The purpose of adding markup is to name and identify the boundaries of these elements for further processing.

For example, consider a typical book. At the very top level, the book is itself an element. This ``book'' element obviously contains chapters, which can be considered to be elements in their own right. Each chapter will contain more elements, such as paragraphs, quotations, and footnotes. Each paragraph might contain further elements, identifying content that was direct speech, or the name of a character in the story.

You might like to think of this as ``chunking'' content. At the very top level you have one chunk, the book. Look a little deeper, and you have more chunks, the individual chapters. These are chunked further into paragraphs, footnotes, character names, and so on.

Notice how you can make this differentation between different elements of the content without resorting to any SGML terms. It really is surprisingly straightforward. You could do this with a highlighter pen and a printout of the book, using different colours to indicate different chunks of content.

Of course, we do not have an electronic highlighter pen, so we need some other way of indicating which element each piece of content belongs to. In languages written in SGML (HTML, DocBook, et al) this is done by means of tags.

A tag is used to identify where a particular element starts, and where the element ends. The tag is not part of the element itself. Because each DTD was normally written to mark up specific types of information, each one will recognise different elements, and will therefore have different names for the tags.

For an element called element-name the start tag will normally look like <element-name>. The corresponding closing tag for this element is </element-name>.

Example 3-1. Using an element (start and end tags)

HTML has an element for indicating that the content enclosed by the element is a paragraph, called p. This element has both start and end tags.

    <p>This is a paragraph.  It starts with the start tag for
      the 'p' element, and it will end with the end tag for the 'p'
      element.</p>
    
    <p>This is another paragraph.  But this one is much shorter.</p>

Not all elements require an end tag. Some elements have no content. For example, in HTML you can indicate that you want a horizontal line to appear in the document. Obviously, this line has no content, so just the start tag is required for this element.

Example 3-2. Using an element (start tag only)

HTML has an element for indicating a horizontal rule, called hr. This element does not wrap content, so only has a start tag.

    <p>This is a paragraph.</p>
    
    <hr>
    
    <p>This is another paragraph.  A horizontal rule separates this
      from the previous paragraph.</p>

If it is not obvious by now, elements can contain other elements. In the book example earlier, the book element contained all the chapter elements, which in turn contained all the paragraph elements, and so on.

Example 3-3. Elements within elements; <em>

    <p>This is a simple <em>paragraph</em> where some
      of the <em>words</em> have been <em>emphasised</em>.</p>

The DTD will specify the rules detailing which elements can contain other elements, and exactly what they can contain.

Important: People often confuse the terms tags and elements, and use the terms as if they were interchangeable. They are not.

An element is a conceptual part of your document. An element has a defined start and end. The tags mark where the element starts and end.

When this document (or anyone else knowledgable about SGML) refers to ``the <p> tag'' they mean the literal text consisting of the three characters <, p, and >. But the phrase ``the <p> element'' refers to the whole element.

This distinction is very subtle. But keep it in mind.

Elements can have attributes. An attribute has a name and a value, and is used for adding extra information to the element. This might be information that indicates how the content should be rendered, or might be something that uniquely identifies that occurence of the element, or it might be something else.

An element's attributes are written inside the start tag for that element, and take the form attribute-name="attribute-value".

In sufficiently recent versions of HTML, the <p> element has an attribute called align, which suggests an alignment (justification) for the paragraph to the program displaying the HTML.

The align attribute can take one of four defined values, left, center, right and justify. If the attribute is not specified then the default is left.

Example 3-4. Using an element with an attribute

    <p align="left">The inclusion of the align attribute
      on this paragraph was superfluous, since the default is left.</p>
    
    <p align="center">This may appear in the center.</p>

Some attributes will only take specific values, such as left or justify. Others will allow you to enter anything you want. If you need to include quotes (") within an attribute then use single quotes around the attribute value.

Example 3-5. Single quotes around attributes

    <p align='right'>I'm on the right!</p>

Sometimes you do not need to use quotes around attribute values at all. However, the rules for doing this are subtle, and it is far simpler just to always quote your attribute values.

3.2.1. For you to do...

In order to run the examples in this document you will need to install some software on your system and ensure that an environment variable is set correctly.

Download and install textproc/docproj from the FreeBSD ports system. This is a meta-port that should download and install all of the programs and supporting files that are used by the Documentation Project.

Add lines to your shell startup files to set SGML_CATALOG_FILES.

Example 3-6. .profile, for sh(1) and bash(1) users

    SGML_ROOT=/usr/local/share/sgml        
    SGML_CATALOG_FILES=${SGML_ROOT}/jade/catalog
    SGML_CATALOG_FILES=${SGML_ROOT}/iso8879/catalog:$SGML_CATALOG_FILES
    SGML_CATALOG_FILES=${SGML_ROOT}/html/catalog:$SGML_CATALOG_FILES
    SGML_CATALOG_FILES=${SGML_ROOT}/docbook/catalog:$SGML_CATALOG_FILES
    export SGML_CATALOG_FILES

Example 3-7. .login, for csh(1) and tcsh(1) users

    setenv SGML_ROOT /usr/local/share/sgml
    setenv SGML_CATALOG_FILES ${SGML_ROOT}/jade/catalog
    setenv SGML_CATALOG_FILES ${SGML_ROOT}/iso8879/catalog:$SGML_CATALOG_FILES
    setenv SGML_CATALOG_FILES ${SGML_ROOT}/html/catalog:$SGML_CATALOG_FILES
    setenv SGML_CATALOG_FILES ${SGML_ROOT}/docbook/catalog:$SGML_CATALOG_FILES

Then either log out, and log back in again, or run those commands from the command line to set the variable values.

Create example.sgml, and enter the following text;

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    
    <html>
      <head>         
        <title>An example HTML file</title>
      </head>
    
      <body>        
        <p>This is a paragraph containing some text.</p>
    
        <p>This paragraph contains some more text.</p>
    
        <p align="right">This paragraph might be right-justified.</p>
      </body>       
    </html>

Try and validate this file using an SGML parser.

Part of textproc/docproj is the nsgmls(1) validating parser. Normally, nsgmls(1) reads in a document marked up according to an SGML DTD and returns a copy of the document's Element Structure Information Set (ESIS, but that is not important right now).

However, when nsgmls(1) is given the -s parameter, nsgmls(1) will suppress its normal output, and just print error messages. This makes it a useful way to check to see if your document is valid or not.

Use nsgmls(1) to check that your document is valid;
```
    % nsgmls -s example.sgml
```
As you will see, nsgmls(1) returns without displaying any output. This means that your document validated successfully.

See what happens when required elements are omitted. Try removing the <title> and </title> tags, and re-run the validation.

    % nsgmls -s example.sgml
    nsgmls:example.sgml:5:4:E: character data is not allowed here
    nsgmls:example.sgml:6:8:E: end tag for "HEAD" which is not finished

The error output from nsgmls(1) is organised into colon-separated groups, or columns.

Column	Meaning
1	The name of the program generating the error. This will always be `nsgmls`.
2	The name of the file that contains the error.
3	Line number where the error appears.
4	Column number where the error appears.
5	A one letter code indicating the nature of the message. `I` indicates an informational message, `W` is for warnings, and `E` is for errors[a], and `X` is for cross-references. As you can see, these messages are errors.
6	The text of the error message.
Notes: a. It is not always the fifth column either. `nsgmls -sv` displays `nsgmls:I: SP version "1.3"` (depending on the installed version). As you can see, this is an informational message.

Simply omitting the <title> tags has generated 2 different errors.

The first error indicates that content (in this case, characters, rather than the start tag for an element) has occured where the SGML parser was expecting something else. In this case, the parser was expecting to see one of the start tags for elements that are valid inside <head> (such as <title>).

The second error is because <head> elements must contain a <title> element. Because it does not nsgmls(1) considers that the element has not been properly finished. However, the closing tag indicates that the element has been closed before it has been finished.

Put the title element back in.