Contents
(summary)     [detail]

 Part 1  Introduction: EBNF Rules for XML Chapter 1 Statement of the Problem 2 A Quick and Dirty Parser: Version 1

Contents
[summary]     (detail)

 Part 1  Introduction: EBNF Rules for XML Chapter 1 Statement of the Problem [summary]   [detail] Section 1.1 Introduction 1.2 Overview of XML-Examples 1.3 XML Applications-Our Goals 1.4 A Strategy for Writing Code 1.5 A Program's Shell Chapter 2 A Quick and Dirty Parser: Version 1 [summary]   [detail] Section 2.1 Introduction 2.2 Version 1's Goals and Tasks 2.3 Data Structures-The Heading 2.4 Initialization and File Input 2.5 Parsing 2.6 The Output Phase-Writing Reports 2.7 Summary

Part 1   Introduction: EBNF Rules for XML

Chapter 1

Statement of the Problem

[summary]   [detail]   [next chap]

 Section 1.1 Introduction 1.2 Overview of XML-Examples 1.3 XML Applications-Our Goals 1.4 A Strategy for Writing Code 1.5 A Program's Shell

Section 1.1 Introduction      [next sect]

Dear Reader: We (you and I) are about to undertake a long journey. Before we begin, I should answer two questions:

1. What do I mean by dual publishing? and
2. What does it have to do with XML (extensible markup language)?

It was not too many years ago when desktop publishing was the rage. Then, everyone's goal was to transform material so that it was typeset, beautifully, on paper. When the internet began to be widely used, people wrote filters that allowed typeset material to be viewed with a browser on the screen. Other people wrote filters that changed material in the other direction. Such filters serve a need, but in my experience they rarely prepare material that is optimal for both paper and screen. In effect, using the filters is an example of allowing tools to dictate what you can create instead of creating new tools to aid in the creation of what you can visualize.

Figure 1.1 Alternate approaches to dual publishing

However, as option 3 in Figure 1.1 suggests, there is another way. You begin by typesetting your material (beautifully, of course). Then, you rethink your material and prepare it with an appropriate use of hyperlinks for the screen. This two-step process results in material that is customized for both paper and screen. It should be better than what you could create using the filter approach. But, if you stop here, you've violated one of the cardinal rules of programming and data creation. Never store material that can change in two different places, because when the material does change, as it sooner or later will, you will invariably forget to make the change in one of the places. Then, chaos will rule.

So, if you decide to play the two-step game, you will need to do the following. First, look at the paper and screen representations of your material and think about how you built each one. Second, abstract out the key parts of each representation. Third, use that understanding to create a new form of your material. When I do this, I end up viewing my material as a number of chunks of data, which I separate with tags, instead of as a collection of words, sentences, and paragraphs. Finally, write custom software that pushes the data back to your original visions on paper and screen. I can hear some of you saying, That sounds like a lot of work,'' and if you have only a page or two of material, it is. But, I've found the extra work can simplify longer projects.

With the above discussion I can answer the questions I asked at the start of this section. In this book dual publishing means using the two-step approach to preparing material for paper and screen. I completed two projects, typesetting an academic journal and preparing a reference manual for TeX, using this approach before I heard about XML. In each project my tags were a mixture of HTML and troff. After I learned about XML, I realized I could easily map my tags into pure XML ones. In other words XML is a natural way to prepare tags that produce chunks of data.

Before we look at XML in more detail, I want to make one additional point. As Figure 1.2 suggests, the push of your data to paper actually has two components. Your custom software is the first, and it prepares an intermediate file that will hold TeX, troff, rtf, or some other format. The second feeds that file and a formatting file, which you must also prepare, to the underlying program: TeX, groff, or whatever. That program does the actual typesetting. This turns out to have an advantage that only becomes apparent after you've played this game a few times. Namely, you can create several format files and then easily push your data into a number of different formats on paper.

Figure 1.2 The two components of the push from data to paper

Section 1.2 Overview of XML-Examples      [cur chap]   [prev sect]   [next sect]   [next chap]

This section contains three examples that illustrate the ideas from Section 1. None of the examples is contrived; each is from the real-world. Collectively, the examples will allow me to explain why XML is powerful and useful. The first is from an early draft of this book. The second involves grammars-in particular the grammar for XML-specified using EBNF notation (extended Backus-Naur Form). The third shows a page from my book, TeX Reference Manual.

Example One

I prepared this book using XML. Looking at a small part of that material is a good place to begin our study of XML and dual-publishing. Figure 1.3 shows lines from an early draft of page 1. The chunks of data I referred to in the previous section are readily apparent in these lines. In particular, the lines contain seven tags: newDiv, title, subtitle, p, parStyle, figShow, and figRef. Two of the tags are empty-element tags: figShow and figRef. The others have both a start tag and an end tag. Four of the tags have attributes: id for figShow, figRef and type for newDiv, and bulstyle and bskip for parStyle. Clearly, the type attribute is optional, and it is reasonable to guess that its default value is "Section". Also, it is reasonable to guess that id is a required attribute for both its tags. Both guesses are correct. The parStyle tag is much more complicated. It has ten attributes, and each one is optional. This tag allows me to make ordered and unordered lists and other custom paragraphs.

 ----------------------------------------------------------------------- IntroductionEBNF Rules for XML Statement of the Problem Introduction

[bsect]Dear Reader: We (you and I) are about to undertake a long journey. Before we begin, I should answer two questions:

What do I mean by [it]dual publishing[/it]? and

What does [it]it[/it] have to do with XML (extensible markup language)?

. . .

However, as option 3 in suggests, there is another way. You begin by typesetting your material (beautifully, of course). Then, you ...

----------------------------------------------------------------------- Figure 1.3 XML form of an early draft of page 1 of this book

The lines raise several questions. The first is: Why did I make a single tag newDiv and use an attribute to distinguish between Parts, Chapters, and Sections instead of making three tags: newPart, newChapter, and newSection? This is a basic question about using XML. In is important to emphasize that there is not a single correct-way to make tags. I could have made three, but each new tag requires overhead (i.e., programming support and other hand-holding). I couldn't see doing the extra work for newPart since I planned to use it at most three or four places in the book. Once I decided to use an attribute for Part, it made sense to do the same for Chapter. In short, preparing tags is more of an art than a science.

A second question is: Are there typos in the first two paragraphs? In particular, what are [bsect] and [it]dual publishing[/it]? Are they tags? The answer is, No, they are not tags, but they could be.' I didn't make them tags because that would have required more work. If I had written them: &bsect; and &it;dual publishing&it1;, you would guess that they were entity references. And they almost are, but it turns out entity references are not useful when you use XML with dual-publishing because they are not general enough. The problem is this. When I send the first paragraph to paper, I need [bsect] to become the TeX control sequence \beginSect. But when I send it to the screen, I need the item to go away. Similarly, [it]dual publishing[it] becomes either {\it dual publishing} or <i>dual publishing</i>. Both these substitutions are simple sed-like changes and don't require the power that tags provide. This is a subtle point, and it should become clear when we look at the more general problem of character translation later in this Part and in Part 2.

I want to raise a final question that may seem strange, but it introduces the next example. Where do tag, start tag, end tag, empty-element tag, attribute, and the other words used with XML come from?

Example Two

The syntax of XML is specified by a grammar with over 80 production rules. Those rules are discussed in the W3C document entitled, Extensible Markup Language (XML) 1.0 (Second Edition).'' That document is available on the internet in several formats: pdf, html, ps, and xml. One of the first things I did, after I downloaded an XML version of the document, was write a sed script that pulled the productions out of W3C's file, stripped tags-meaningful to W3C but extraneous to XML's underlying grammar-out of the productions, and wrote the results to a new file. Figure 1.4 shows lines from that file. Three chunks of data are readily visible: left sides, right sides, and productions.

 ------------------------------------------- document prolog element Misc* element EmptyElemTag | STag content ETag content CharData? ((element | CDSect | PI | Comment) CharData?)* STag '<' Name (S Attribute)* S? '>' ------------------------------------------- Figure 1.4 XML form of EBNF production rules for a subset of XML

Clearly, this new file holds everything needed to display XML's production rules using EBNF's more familiar notation. What may not be clear, at least until you've worked with this file and begun to think in dual-publishing mode, is that the file also holds a great deal of meta-information. The next section shows how it may be used.

Example Three

I had a difficult time learning TeX. After I completed my first project that used it, I decided to write a reference manual for its primitives. I had not heard of XML at the time, but I decided to build an ASCII database that would hold the bulk of the manual. TeX has 325 primitives, and I made one record in my database for each primitive. Also, I set up fields, as necessary, to hold the chunks of information I used to document each primitive. Figure 1.5 shows an XML version of my record for \vrule. It took me less than a minute to convert this material by hand from my format into XML.

 ------------------------------------------------------------------------------- 64, 86, 151, 221--222, 224, 245--247, 281--282, 283, 357, 392, 420 221--224 [height[lt0]dimen[gt0] depth[lt0]dimen[gt0] width[lt0]dimen[gt0]] makes a rule box in horizontal mode A rule box is a solid black rectangular box with a height, depth, and width. Such a box may look like a horizontal or vertical line. The [tt1]\vrule[/tt1] command makes a rule box, and it is usually used in horizontal mode. If none of the box dimensions are specified, the box has width 0.4pt, and its height and depth extend to the smallest box that encloses it. No glue is placed to the left or right of a [tt1]\vrule[/tt1] [tr]221[/tr]. If [tt1]\vrule[/tt1] is used in vertical mode, a new paragraph is started, and the rule box is typeset [tr]222[/tr]. However, a [tt1]\vrule[/tt1] may be used with [tt1]\leaders[/tt1] in vertical mode [tr]224--225[/tr]. hrule, leaders Here is a square vrule: \vrule height 0.23in depth 0.02in width 0.25in\par \def\tvr#1#2#3{#1. Here is another type of vrule #2\ in a line.#3\par} \tvr A{\vrule width 1in}{} \tvr B{\vrule height 0.4pt width 1in}{} \tvr C{\vrule height 3.4pt depth -3pt width 1in}{} \tvr D{\vrule}{}% Note: Caslon's D is ~0.74pt wider than its E (at 10pt). \tvr E{\vrule}{\vrule height 8.5pt depth 3.5pt width 0pt} % Use a strut. Line 1 typesets a square with quarter-inch sides. Line 2 defines a macro which is used on lines 3--7 to typeset several types of vrules. Lines 3 and 4 show how the height and depth of a vrule expand if neither is specified. Line 5 shows how a negative depth in a vrule works. Line 6 shows that if no dimensions are specified for a vrule, [tex0] makes a vertical rule. Finally, the third parameter on line 7 acts as a strut which increases the height and depth of the line. That explains why the rule made by the second parameter on line 7 is larger than the rule made by line 6. ------------------------------------------------------------------------------- Figure 1.5 XML form of TRM's reference page for \vrule

At first glance the lines in Figure 1.5 may appear intimidating, but if you look closely at them, you should see they combine aspects from the the first two examples. In this example primitive plays the same role that prod did in the previous example. Both are records. But, several of primitive's fields contain paragraphs of text. So, this example is also similar to the first one. Several of the tags in this example use attributes. It's not apparent from \vrule's record, but some of the attributes are required, and others are optional. In fact, some of the tags themselves are optional. Also, documenting \vrule does not require all the tags that are available. Finally, the tags in a primitive can occur in any order. It was easy to keep the two fields in each prod record in order, but if you begin to add fields to the records that make up this project, you will appreciate the luxury of not having to keep the fields in a particular order.

Summary

Now that we've looked at several examples, I want to summarize why XML is both powerful and useful:

• XML does everything in ASCII. So, you can not only use your favorite text editor to create and maintain the XML files in a project, but you can also use all the familiar tools (e.g., sed, grep, wc, sort, ...) that work with ASCII data. Also, using ASCII means that anyone who has access to your data can look at it and see what you've done. HTML has that feature, and it is one reason HTML became popular so quickly.
• XML allows you to combine variable length records with marked-up text. You can easily sort the records or search for a record that you want to use at a particular place in a document. This feature gives XML documents tremendous power.
• XML files are fast to process. After I make a change to a 50-page document in my editor, it takes me less than five seconds to convert my XML to PostScript and to move to Ghostview where I can see the change in a typeset document.
• XML files are relatively easy to parse.

The last reason is important, because your XML files are just a collection of bits and bytes on your hard disk until you write an application that parses the tags in your files. Applications lead us directly to the next section.

Section 1.3 XML Applications-Our Goals      [cur chap]   [prev sect]   [next sect]   [next chap]

I want to emphasize the last point in the previous section. By itself, your XML data is just so much space on your hard disk until you write an application that acts as the push in option 3 of Figure 1.1. The same thing is true for the three examples I gave in the last section. Only those examples live on my hard disk. So, I'm responsible for providing their applications.

In this section we will first look at the paper and screen versions of those examples. Then, I will state our goals for the rest of this book.

Example One

Figure 1.6 shows the paper version of the material in Figure 1.3. This chapter's first screen or two shows its screen version. If you trace Figure 1.1 back to its source, you'll discover it began life as the file, pic03.pic. I'll show in Part 3 how easy it is for an XML dual-publishing application to turn a .pic file into a .tex file, when it wants to display the file on paper, or into a .gif file, when it wants to display it on the screen.

Figure 1.6 Paper form of the beginning of Chapter 1 of this book

Example Two

Figures 1.7 and 1.8 show the screen and paper versions of the material in Figure 1.4. Each version answers two questions that can be frustratingly difficult to answer using normal EBNF for a grammar with many productions. Namely:

1. given a nonterminal that appears on the right side of a production (e.g., CharData in production 3 for content) where is the production for the nonterminal? and
2. given a production that defines a nonterminal (e.g., production 9 for Name) what other productions use the nonterminal?

The second question is easy to answer since both versions contain Used by lines, and one follows every nonterminal's production when the nonterminal is used by other productions. Also, each version allows a reader to move quickly from a nonterminal (e.g., CharData on production 3 or PITarget in the Used by area of production 9) to the production for the nonterminal. The screen version uses links. You can click on a nonterminal and move to the appropriate production. The paper version typesets each nonterminal's production-number as a superscript. The ease with which the versions answer the above questions is an example of the meta-information that I said Figure 1.4 contains.

I hope you will stop for a few moments and think about this example. It is an excellent illustration of the ideas from section 1. In particular, ask yourself how you would transform the data in Figure 1.4 into the screen and paper representations shown here.

Subset of XML Productions
Used by: document, content Used by: element

4. STag ::= '<' Name (S Attribute)* S? '>'

Used by: element Used by: element Used by: STag, EmptyElemTag

7. Eq ::= S? '=' S?

Used by: Attribute, VersionInfo, SDDecl, EncodingDecl

8. AttValue ::= '"' [^<"]* '"' | "'" [^<']* "'"

Used by: Attribute

9. Name ::= (Letter | '_' | ':') (NameChar)*

Used by: STag, ETag, Attribute, PITarget, EmptyElemTag

10. NameChar ::= Letter | Digit | Extender | '.' | '-' | '_' | ':'

Used by: Name

11. Letter ::= [#x41-#x5A] | [#x61-#x7A] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#xFF]

Used by: Name, NameChar

12. Digit ::= [#x30-#x39]

Used by: NameChar

13. Extender ::= #xB7

Used by: NameChar

14. Char ::= [#x00-#xFF]

Used by: Comment, CData, PI

15. CharData ::= [^<]* - ([^<]* ']]>' [^<]*)

Used by: content

16. S ::= (#x20 | #x9 | #xD | #xA)+

Used by: STag, ETag, Eq, PI, Misc, EmptyElemTag, XMLDecl, VersionInfo, SDDecl, EncodingDecl

17. Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

Used by: content, Misc Used by: content

19. CDStart ::= '<![CDATA['

Used by: CDSect

20. CData ::= (Char* - (Char* ']]>' Char*))

Used by: CDSect

21. CDEnd ::= ']]>'

Used by: CDSect

22. PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'

Used by: content, Misc

23. PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

Used by: PI Used by: document, prolog Used by: element Used by: document Used by: prolog

28. VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')

Used by: XMLDecl

29. VersionNum ::= ([a-zA-Z0-9_.:] | '-')+

Used by: VersionInfo

30. SDDecl ::= S 'standalone' Eq (("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"'))

Used by: XMLDecl

31. EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )

Used by: XMLDecl

32. EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*

Used by: EncodingDecl
Figure 1.7 Screen form of EBNF production rules for a subset of XML

Figure 1.8 Typeset form of EBNF production rules for a subset of XML

Example Three

Figures 1.9 and 1.10 show the screen and paper versions of \vrule's reference page. The paper version should be self-explanatory, but the following comments may help you visualize how the screen version works.

vrule Box Command
Synopsis: \vrule[height<dimen> depth<dimen> width<dimen>]

Description:

A rule box is a solid black rectangular box with a height, depth, and width. Such a box may look like a horizontal or vertical line. The \vrule command makes a rule box, and it is usually used in horizontal mode. If none of the box dimensions are specified, the box has width 0.4pt, and its height and depth extend to the smallest box that encloses it. No glue is placed to the left or right of a \vrule [221]. If \vrule is used in vertical mode, a new paragraph is started, and the rule box is typeset [222]. However, a \vrule may be used with \leaders in vertical mode [224-225].
Example:

  1. Here is a square vrule: \vrule height 0.23in depth 0.02in width 0.25in\par
2. \def\tvr#1#2#3{#1. Here is another type of vrule #2\ in a line.#3\par}
3. \tvr A{\vrule width 1in}{}
4. \tvr B{\vrule height 0.4pt width 1in}{}
5. \tvr C{\vrule height 3.4pt depth -3pt width 1in}{}
6. \tvr D{\vrule}{}% Note: Caslon's D is ~0.74pt wider than its E (at 10pt).
7. \tvr E{\vrule}{\vrule height 8.5pt depth 3.5pt width 0pt} % Use a strut.
Produces: See typeset version.

• Line 1 typesets a square with quarter-inch sides.
• Line 2 defines a macro which is used on lines 3-7 to typeset several types of vrules.
• Lines 3 and 4 show how the height and depth of a vrule expand if neither is specified.
• Line 5 shows how a negative depth in a vrule works.
• Line 6 shows that if no dimensions are specified for a vrule, TeX makes a vertical rule.
• Finally, the third parameter on line 7 acts as a strut which increases the height and depth of the line. That explains why the rule made by the second parameter on line 7 is larger than the rule made by line 6.
TeXbook References: 221-224. Also: 64, 86, 151, 221-222, 224, 245-247, 281-282, 283, 357, 392, 420.

For Related Examples, see: crcr, everycr, halign, vadjust.

Figure 1.9 Screen form of TRM's reference page for \vrule

The file trm.html is nearly 900K. It consists of three separate lists of primitives (remember TeX has 325 primitives): a summary list in alphabetical order, a summary list in family order, and the reference pages, which are also in alphabetical order.

Each of the summaries is organized as a collection of tables whose rows hold a primitive and its brief description. If you click on the primitive, you are moved to its reference page. Once you're on that page (i.e., something similar to Figure 1.9) you can click on the primitive's name (vrule in the upper-left corner of Figure 1.9), and you are moved back to the primitive's row in the alphabetical summary area. Or, you can click on a primitive's family (Box in the upper-right corner of Figure 1.9), and you are moved to the primitive's row in the family summary area. Both summary areas have links spread through them. So, you can quickly move around in each area or move to the other summary area. In short, trm.html contains enough links so that you can fly around in the 900K file, quickly and easily.

The above description suggests that Figure 1.5 also contains meta-information, and that it can be incorporated into a program.

Figure 1.10 Typeset form of TRM's reference page for \vrule

Our Goals

Each of the applications that I used to prepare the examples described in this section has three phases. Figure 1.11 shows them and how they interact.

Figure 1.11 The 3 phases of the applications described in this section

Phase 1 loads a project's primary file into a buffer in memory and passes the buffer to phase 2, the parsing phase. Periodically, the parsing phase interrupts itself and sends a stream of tokens back to phase 1. That is where the real work of the application takes place. At this stage two things can happen. Phase 1 can complete its work on the tokens and send items to phase 3 where they are translated, as necessary, and written to either a .tex or an .html file. Or, phase 1 may discover it needs to load another file. So, it does. Then, it sends that new buffer to phase 2. This sets up a recursive situation. Eventually, the parsing phase completes its work on the original buffer, and processing ceases.

Now, it turns out that with two exceptions phases 2 and 3 are the same for the applications we are considering in this section. Those exceptions are:

1. Each application has its own collection of tags. Even if two applications share several tags, they may handle them differently.
2. Some of the characters and other things that need to be translated are application specific.

Fortunately, it is easy to prepare files holding everything the parsing and character translation phases need to perform their tasks. That makes it possible to write code, a front end, that performs those tasks and that many applications can share. Having a front end that is debugged and ready to go will greatly simplify writing the type of application I'm describing. Such an application will work just as I described above except that, before it passes the first buffer to the parsing phase, it will load its tag-information file and send it to the parser's initialization function. Then it will load the file the character-translation phase needs to initialize itself. Once both tasks are successfully completed, the application will continue as I described above. I call this common code a front end because applications can share the code in the same way C compilers, written for different architectures, can share code (e.g., lexical analysis and parsing, string and memory management, and error reporting).

I can now state our goals. We want to:

1. develop a front end similar to what I've just described.
2. use the front end with a real application.
3. see first hand how complicated software evolves.

Unfortunately, unless you write compilers for a living, which I hasten to add I don't, I am afraid I would quickly lose many readers if I jumped right in to the front end. So, we will do what I did originally and begin by looking at the code for a Q&D (quick and dirty) processor that takes the productions from Figure 1.4 and prepares material similar to that shown in Figures 1.7 and 1.8.

The remaining chapters in Part 1 show how my Q&D processor evolved.

Then, Part 2 builds on what we've learned and writes a full-blown front end for dual-publishing applications.

Finally, Part 3 develops the application I used to prepare this book.

However, before we begin the Q&D, I want to step back and make some observations about writing code.

Section 1.4 A Strategy for Writing Code      [cur chap]   [prev sect]   [next sect]   [next chap]

Every writer eventually develops a style. I always put braces in the same places in my programs. I use the same indentation. I use similar variable names in all my programs. I write the same type of comments. I've learned that doing things automatically, without thinking, makes me a better writer. In fact, I'm working best when I'm sitting and typing and just barely conscious of what I'm typing. That may seem paradoxical to you, but I'm actually concentrating so completely on what I need to do now, at this very instant, that I'm not aware of anything else. If you are still developing your style, you can look at my code and pick and choose aspects from it. But, unfortunately, I don't know how to teach you to work the way I do.

Writing also has a second aspect. Some days, things do not go well! If you've never sat at your keyboard and wondered how in the world you're going to write the code for a particular problem, then you may skip this section. But, if you are like me and find yourself, at times, stumped or, what is even worse, seriously blocked and unable to do anything, then you may find this section helpful.

Three Kinds of Problems

Years ago, when I was a graduate student, one of my professors was in a reflective mood and told those of us sitting in his class that day that there were three kinds of problems:

1. baby problems. These were so easy that, if you considered one, you immediately knew how to solve it and could write out a solution quickly. He added that these were not worth spending time on.
2. impossible problems. These were so hard that you knew you didn't know how to solve them or even begin them. At that time Fermat's Last Theorem was unsolved, and it was, for most mathematicians, an impossible problem-one you could spend years working on and end up with nothing to show for your work.
3. everything else. These were, my professor emphasized, fair game. They, particularly the more difficult ones in this set, were the ones you should try to solve.

Now, the two bars that break problems into the above three sets are different for each person. Also, more importantly, the bars move over time for each person. If you think back to the problems you thought difficult one and five years ago, you will see what I mean.

In summary: if the problem you want to solve or the code you are trying to write is in your third set, then my comments may help you develop a strategy for solving it.

Where do you want to go?

When I begin a project that is one of my everything-else problems, I like to think of it as a journey. Using the analogy between solving a problem and undertaking a journey, I can rephrase the ideas in the previous subsection. When you set out on a journey,

• you must know where you want to go, but
• you do not have to know how to get there.

I cannot over emphasize the previous two points. Keeping the second one firmly in mind is critical in avoiding writer's block.

Writer's Block

Getting stuck, or blocked, can occur at two places on a journey: at the beginning or after we get started if we run into a particularly nasty obstacle. Getting stuck would not be too bad, except that almost as soon as we are, the voice starts. I'm speaking of one of the many voices each of us carries. This one loves to say, You can't do this,'' or Why not quit? You're a failure anyway.'' We rarely acknowledge that such voices exist, so I watched, with interest bordering on fascination, the parts of the second Lord of the Rings movie where Gollum wrestled and fought with one of his voices when it urged him to betray Frodo and steal back the ring. Those scenes, for me, were the most realistic part of the fantasy movie, because when I get blocked, and my voice starts, I fight it just as Gollum did. Of course, I can't just fight it because the voice feeds on attention. The rest of this section looks at other things I do.

Jump and Hack

One of the most difficult parts of going on a journey is to leave, to take the first step on the road from which you may not return. When I begin a new project, the first thing I do is create a directory (or subdirectory) for the files that I will create as the project grows. Then, I decide on the name for the first program I will write. When I am faced with those simple decisions, I can easily get sidetracked into focusing on the hundreds upon hundreds of other names the project will require. I learned to program when names of directories, files, functions, and variables were limited to 6 or 8 characters, and I still use short abbreviations for most of the names I need. That is part of my style. Since I also keep most of them in my head, one of my constant fears is that I'll select a poor first name and run into naming conflicts later. There was a time when I would get blocked and spend several hours, or even a morning, toying with the name for a project's first directory and program.

Now, I spend a few minutes, and if I feel myself getting stuck, then before the voice can start, I'll say, David, JUMP-makeup a name-it doesn't matter what it is-you can change it later-just do it-let's go.'' That is usually enough to get me off and running.

My immediate goal-and now I'm working as quickly as I can-is to take an old program that is similar to what I'm trying to do, hack it to pieces-until it is just a shell of itself-take the Makefile I used with the old program, change it so it works with my new shell, and then tweak things until my new shell compiles-the next section gives an example of this process. Later, when I'm in the middle of the program, I'll continue to hack code from another program or even from another place in this one if that will help me make progress on my current task.

As soon as I have a shell that compiles, I know that I can add features to it and that those features will lead me towards the project's goal.

Solve a Simpler Problem

When you're on a journey, there are times when you can take a detour and still arrive at your destination before you would if you went straight. Solving a simpler problem may initially seem like extra work, but it may help you solve your real problem more quickly than you otherwise would.

For example, I've said that our goal for the remaining chapters in Part 1 is to write a program that turns productions, like the ones in Figure 1.4, into the forms shown in Figures 1.7 and 1.8. As soon as that program's shell compiled, I realized this problem was too big for me to do in one step. So, I thought about an easier problem. Eventually, I realized that if I simplified the parsing, I could write and debug other aspects of the program. When they worked, I could come back and write a proper parsing routine. I decided my first program should work with any ASCII file-not just ones holding productions-and should break it into whitespace and everything else. With that decision my program needed to:

• read a file into memory,
• parse the file and break it into the above tokens,
• store the tokens in memory until the parsing was complete, and then
• use the tokens to recreate the file and write it back to my disk.

Clearly, if I can't do what I've just described, there is no way that I can write the real productions program. But, if I do the above, I only have to change the parsing and recreate routines to make the productions program work.

Divide and Conquer-to-do Lists

Figure 1.12 shows the beginning and ending points of a journey or project. Faced with those points, you should ask yourself, Can I move directly between the points?'' If you can, you do, and you're done!

Figure 1.12 The first and last points on a journey or project

But if you can't, then you need to add intermediary points between the two. Figure 1.13 shows the new points. At this stage you focus on two adjacent points and ask the above question.

Figure 1.13 Intermediary points between the first and last points

It's quite possible that the new points are still too far apart. So, you repeat the process and add more intermediary points. Eventually, you should reach the situation shown in Figure 1.14. It shows a collection of paths connecting the intermediary points between two points. Slowly, you work your way back up the chain connecting points and adding new ones until you have an unbroken path that leads from your first point to your last one.

Figure 1.14 Paths connecting intermediary points between two points

Now, you may ask yourself how all these points and paths are going to help you write a chunk of code. That's where to-do lists come in. When I'm working on a project and need to decide what to do next, I look at my to-do lists. Each list sits on its own piece of paper and contains the words To Do, the date when I started the list, and the tasks I have to do in order to complete my project or part of my project. Each task on one of my lists corresponds to one of the points I described above. When I get ready to write the code for a particular task, I am faced with two situations:

• the task is easy (i.e., it is one of my baby problems) and I just type in the code without thinking, or
• the task is hard (i.e., an everything-else problem, but one that is easier than the actual project I am working on), and I start a new to-do list that breaks the task into smaller parts.

If you look at the previous subsection, you will see examples from my to-do lists. First, the decision to begin with a simpler problem was on my project's list. Then, the list of things the first program needed to do was a first pass at my to-do list for it.

There are times when a task refuses to break cleanly into simpler tasks. So, I sit and think and write the code on paper. I refine it (possibly rewriting it) until I'm comfortable with it. Then, I just type it in without thinking.

At this point you should begin to see that I never really do my original project. Instead I'm continually working on pieces of it. One of the things you need to learn is how to put pieces together.

Programs Grow-Debug as You Go

I start each new project with the program shell I described above. It doesn't have any bugs since it doesn't do anything except compile. Each time I add a task on a to-do list to my program, I recompile and test that the task does what it needs to do. I try to break the new code using every trick I know. I've learned that I will never understand the chunk of code I've just typed in better than I do at this moment. Also, I know it will take me less time to fix problems that may be in the code now, when I still have each of its lines at my fingertips, than it will later after I've written several thousand other lines of code.

If we go back to Figure 1.12 and pretend the points A and B are on the opposite banks of a river that we want to build a bridge across and the intermediary points are places where we'll construct pilings that will hold supports for the bridge, then the tasks on my to-do lists correspond to those pilings. I want to drive each one down to bedrock so that I know it will support the weight of the program I'm going to ask it to carry.

For example, suppose you need a routine that takes a pointer to a buffer, the buffer's length, and a special character and returns two pointers and two lengths. One pair gives you the string immediately after the first special character in the buffer, and the second gives you everything up to, but not including, the first special character. This routine is not difficult to write, but it has a number of special cases and one nasty case. If you overlook any of them, you will probably let a bug creep into your code.

Here are the cases:

• There is the normal case where the special character appears somewhere in the middle of the buffer. This makes the two pairs we want, but getting the correct lengths is tricky. When I'm ready to calculate them, I'll draw a picture on paper and count.
• There are two special cases where the character we want to match is either the first or last character in the buffer. This makes one of the lengths zero. What is the corresponding pointer?
• There is the special case where the character doesn't appear in the buffer.
• There is the special case where the buffer's length is zero. That means the character can't appear, but if you don't write your code correctly, you can easily let in a bug.
• Finally, there is the nasty case where the original pointer is NULL. If you don't trap it at the start of your code, you'll get a segmentation fault when your program encounters such a pointer. This bug should be easy to fix if you catch it while you're working on this routine. But if the problem shows up a week or month later, you can waste an hour or more trying to track it down.

In order to avoid such problems I try to program defensively, I never trust the data I'm processing, and I test, test, test. One advantage to my method is that when I finish writing code for a project, I'm done with the project. I've already tested it, and I'm confident that it has few bugs.

Summary

This is a section that I hope you will look back at as many times as you need to until the techniques I've described here become part of your style. But now it's time for us to start looking at some code.

Section 1.5 A Program's Shell      [cur chap]   [prev sect]   [next sect]   [next chap]

I described in the previous section how I begin a new program by taking a working program, which does something similar to what I want the new one to do, and hacking it until only a shell remains. Then, I take the working program's Makefile and tweak it until my new shell compiles. This activity takes place in a new directory and uses copies of the working files.

This section describes the shell, Makefile, and several miscellaneous files that we will need in the next chapter when we begin work on the productions program. Actually, that program will go through a number of versions. Each one is called mkxmlpN.c where N is the version number and starts at 0, and the rest of the name stands for make xml productions.'

I call everything that comes before a C file's first function the file's heading. Figure 1.15 shows the heading for our shell, mkxmlp0.c.

-------------------------------------------------------------------------- mkxmlp0.c: heading
2  /* mkxmlp0.c   Process & format XML production rules
3                 This version breaks document into text and whitespace. */
4  char tusage[] =
5  "\n"
6  "usage:\n"
7  "\n"
8  "mkxmlp0 fname   num\n"
9  "\n"
10  "       fname name of input file to work with.\n"
11  "\n"
12  "       num = 0 check 1 file\n"
13  "\n"
14  "       exit status:\n"
15  "            0 - ok.\n"
16  "            1 - usage error.\n"
17  "            2 - other errors.\n"
18  "\n";
19
20  #include <stdio.h>
21  #include <stdlib.h>
22  #include <syslog.h>
23  #include <unistd.h>
24
25  #include "sp0.h"
26
27  #include "OBJ/mkxmlp0.ed0"
28  #include "OBJ/mkxmlp0.sd"
29  #include "sperrlog.ed0"
30  #include "spranfio.ed0"
-------------------------------------------------------------------------- mkxmlp0.c: heading
Figure 1.15 Items from the beginning or heading' of mkxmlp0.c

If you glance quickly at the lines in the heading, you could easily conclude that not much is going on there. In a way you would be right. Remember, the program does nothing! But, if you carefully read each line, I can almost guarantee that you will find at least one question. Every program in this book began life looking like this one. It's critically important that you understand every part of the shell and its Makefile. So, I need to explain how things work, and this is the place where I will do that.

I begin every C file with a comment that gives the file's name. It includes a brief description that explains what the file does. Later, when I'm typing the lines that do a task on one of my to-do lists, I enter comments whenever I do something tricky or when I think the comment will help me figure out, six months down the road, what I did. It is easier, faster, and more accurate for me to enter comments as I enter material in a source file than it would be for me to go back and comment a file after I complete it. My comments are not always good English and they may contain misspellings, but I try to make them accurate!

4-19  tusage

These lines define the global variable tusage. The lines serve two purposes:

• they describe, in far more detail than my opening comment, exactly what the program does, and
• they prepare a message that the program writes to the screen if I run the program with the wrong arguments.

Notice that the lines are a single statement; only the last one ends with a semi-colon. String concatenation was added to C in 1988 when the American National Standards Institute released the ANSI standard for C [K&R 194]. Figure 1.16 shows the usage message mkxmlp5 makes.

 --------------------------------------------------------------- ~/doc/xml/.prules -->$mkxmlp5 usage: mkxmlp5 fname num fname name of input file to work with. num = 0 write check file num = 1 write normalized file. num = 2 write normalized file after it is sorted by lhs. num = 3 write productions (in ASCII). num = 4 write productions sorted by lhs (in ASCII). num = 5 write productions (cross referenced & in ASCII). num = 6 write productions (cross referenced & in TeX). num = 7 write productions (cross referenced & in HTML). output file names: fname.err all num fname.chk num = 0 fname.N num = N > 0 exit status: 0 - ok. 1 - usage error. 2 - other errors. ~/doc/xml/.prules -->$ --------------------------------------------------------------- Figure 1.16 The usage message for mkxmlp5

20-24  System Header Files

Not much is going on here. I'll add a new system file when I discover I need it.

25-26  Program Header Files

This particular file dates back to a project I completed in the early 90s. Figure 1.17 shows sp0.h's lines. When we get to the front end, I'll include these lines in a header file. The productions program doesn't require a header file, and when I made this shell, I didn't want to get distracted and create one.

 ----------------------------------------------- /* sp0.h Basic constants for stock project */ #ifndef _SP0_H #define _SP0_H #define YES 1 #define NO 0 #define ON 1 #define OFF 0 #define PASS 1 #define FAIL 0 #endif ----------------------------------------------- Figure 1.17 Basic constants in sp0.h

27-30  Extern and Static Declarations

Each of my source files (e.g., mkxmlp0.c) requires its own static declaration file (e.g., mkxmlp0.sd). It also requires its own extern declaration file (e.g., mkxmlp0.ed0). If it uses a function defined in another source file, it requires that file's .ed0 file. Figures 1.18 and 1.19 show these files for spranfio.c.

In the early 90s I wrote a program that processed real-time stock market data. My PC received the data via satellite, and I stored the data in a home-grown database. Also, I kept all the functions I used for file I/O in a single source file: spranfio.c-stock project: random file I/O. That file and other files in the stock project used functions in sperrlog.c to write error messages to a log file. Both files are completely debugged, and I use them anytime I need to do I/O in a program. We won't look at their functions in the productions program, but we will when we get to the front end in Part 2.

 ------------------------------------------- /* spranfio.c */ extern int rfopen(char *,int ,int ); extern int rfopen1(char *,int ,int ,int ); extern int rfread(int ,char *,int , int ); extern int rfwrite(int ,char *,int ); extern void rfldblk(int ,char *,int ,int ); extern void rfsvblk(int ,char *,int ,int ); extern int rftell(int ); extern void rfseek(int ,int ,int ); extern int rfsize(int ); extern void rfclose(int ); extern int rfclall(); ------------------------------------------- Figure 1.18 The extern declaration file for spranfio.c

 ------------------------- /* spranfio.c */ static void ftinit(void); ------------------------- Figure 1.19 The static declaration file for spranfio.c

I'll explain why some of the files in this area start with OBJ/ when I discuss the Makefile for mkxmlp0.c.

Finally, if you're curious about sp0.h's name, it also originated in the stock project.

Summary for the Heading

See, there was more going on in Figure 1.15 than a quick glance at its lines suggested.

The Main Routine

But that is not the case for the rest of mkxmlp0.c. Figure 1.20 shows its main routine.

----------------------------------------------------------------------------- mkxmlp0.c: main
40  int main(int argc,char *argv[])
41  {
42       int terr,tnum;
43
44       errlog_init("mkxmlp0",LOG_PID,LOG_LOCAL0);
45       if (argc != 3)
46            usage();
47       tnum = 0;
48       terr = 0;
49       endprog(terr);
50       return(0);     /* keeps compiler happy */
51  }
----------------------------------------------------------------------------- mkxmlp0.c: main
Figure 1.20 The main() routine for mkxmlp0.c

I have only two comments to make about these lines:

1. errlog_init initializes the error-message capability in sperrlog.c.
2. If the program's arguments are not correct, main calls usage (see Figure 1.21). The test shown here becomes more complete in mkxmlp1.c

Utility Functions

Figure 1.21 shows three utility functions that I use in almost every program I write.

---------------------------------------------------------------------------- mkxmlp0.c: usage
54  void endprog(int terr)
55  {
56       exit(terr);
57  }
58
59  void usage(void)
60  {
61       printf("%s",tusage);
62       endprog(1);
63  }
64
65  void error(char *pt1)
66  {
67       printf("ERROR: %s\n",pt1);
68       endprog(2);
69  }
---------------------------------------------------------------------------- mkxmlp0.c: usage
Figure 1.21 Three utility functions in mkxmlp0.c

54-58  endprog

I try not to sprinkle calls to exit throughout my code. Some programs require you to perform a number of housekeeping chores before you end the program. That is easy to do if you have a single function that performs those chores and then calls exit with an appropriate exit status.

59-64  usage

This function displays the usage message held by the global variable tusage defined in the heading. Although tusage is different for each new version of the productions program, the usage function stays the same. Figure 1.16 shows the usage message for mkxmlp5.

65-69  error

When the productions program runs into a problem that is fatal, it calls this function with an appropriate message.

Exit Status

Notice how the parameter included with the call to endprog in main, usage, and error agrees with the exit status specified in tusage. I often bundle a program with related programs in a shell script. That script checks each program's exit status and bails out of the script if the program encountered a problem. I find it easier to make all my programs end with an accurate status than to fix things after the fact if I decide to wrap a program in a script.

The Makefile

Figure 1.22 shows an early version of the Makefile I use to compile the programs in the productions project.

--------------------------------------------------------------------------------------
1  # ~/doc/xml/.prules --- Makefile for productions project
2  #
3  # To get math library, compile with: -lm.
4
5  HPATH   = /home/david/c/sp/sp0
6  SP0     = /home/david/c/sp/sp0/OBJ
7
8  vpath %.h    $(HPATH) 9 vpath %.ed0$(SP0)
10
11  CC      = gcc -I$(HPATH) -I$(SP0)
12  DEBUG   = -g
13  COMPILE = $(CC)$(DEBUG)
14
15  SRCF  = mkxmlp0.c \
16          mkxmlp1.c
17
18  PROGF = $(basename$(SRCF))
19
20  OBJDIR  = OBJ/
21
22  OBJF  = $(addprefix$(OBJDIR),$(SRCF:.c=.o)) 23 24 DEPF =$(addprefix $(OBJDIR),$(SRCF:.c=.d))
25
26  .PHONEY : all
27
28  all : $(PROGF) 29 30 mkxmlp0 :$(OBJDIR)mkxmlp0.o
31          $(COMPILE) -o$@ $^$(SP0)/spranfio.o $(SP0)/sperrlog.o -lm 32 33 mkxmlp1 :$(OBJDIR)mkxmlp1.o
34          $(COMPILE) -o$@ $^$(SP0)/spranfio.o $(SP0)/sperrlog.o -lm 35 36 37$(OBJDIR)%.d   : %.c
38          $(SHELL) -c 'mkpdd 2$< $(@D)' 39 40$(OBJDIR)%.ed  : %.c
41          $(SHELL) -c 'mkpdd 1$< $(@D)' 42 43$(OBJDIR)%.ed0 : $(OBJDIR)%.ed 44$(SHELL) -c 'cmp -s $@$< || cp $<$@'
45
46  $(OBJDIR)%.o : %.c 47${COMPILE} -o $@ -c$<
48
49  include $(DEPF) -------------------------------------------------------------------------------------- Figure 1.22 The Makefile for the productions project Before I explain how the file works, I should explain the tasks I need it to perform: 1. Each of the programs uses files (e.g., sp0.h, spranfio.ed0 and sperrlog.ed0) that live in directories far away from the directory where the productions project lives. So, I need to tell make and the compiler how to find them. 2. Every time I compile a .c file, I get .o, .sd, .ed0, and .d files. I view those files as clutter. I don't want them to be visible in my working directory. So, the first thing I do, after I create a new directory for a project, is create a subdirectory in it named OBJ. That is where I want to store my clutter files. 3. My compiler, gcc, turns .c files into .o files. But it does not create declaration and dependency files-at least I can't figure out how to make them with it. So, when I migrated to Linux, I wrote a small program, mkpdd (make program declarations and dependencies), that did. Figure 1.23 shows mkpdd's usage message.  ------------------------------------------------------------------ ~/doc/xml/.axp -->$ mkpdd usage: mkpdd num file [dir] num: 1 make extern & static function declarations, this makes: file.ed (extern defs) file.sd (static defs) 2 make dependencies for file, this makes: file.d (dependencies) file name of .c to work with (ending .c is optional), [dir] subdirectory to put the new files, eg: mkpdd 1 mkpdd OBJ ~/doc/xml/.axp -->$------------------------------------------------------------------ Figure 1.23 The usage message for mkpdd Now that you know what my Makefile should do, I can explain how it works. 1-3 Comments I once spent a morning trying to compile a program. I kept getting cryptic error messages that I've long since forgotten. Finally, I realized my program was using math functions, and that even though I'd included the necessary math headers in my source file, the compiler needed to be told to load the math library. I was so disgusted with myself, that I put a reminder in that project's Makefile. That explains line 3! 5-13 Task 1-Miscellaneous Files My Makefile uses these lines to perform task 1. In particular, • lines 5 and 6 hold the directory names where sp0.h, spranfio.ed0, and sperrlog.ed0 live; • lines 8 and 9 tell the compiler to add those directories to the paths it searches for .h and .ed0 files; and • lines 11-13 set up the variable COMPILE, which is used on lines 31 and 34. 15-34 Task 2-Compiling My Makefile uses these lines to put each .o file in its proper place. In particular, • Line 15 starts a list of the programs in the productions project. • Lines 20-24 prepare three related lists: 1. PROGF is mkxmlp0 mkxmlp1, 2. OBJF is OBJ/mkxmlp0.o OBJ/mkxmlp1.o, and 3. DEPF is OBJ/mkxmlp0.d OBJ/mkxmlp1.d. Lines 26-28 are a way to tell the Makefile to treat each name in PROG's list as a target and to look for a rule to make that target. Lines 30-34 hold those rules. When I get ready to start version 2, I can add a \' to line 16, insert a line after line 16 that holds mkxmlp2, put a copy of lines 33-34 after line 35, and change each 1' on the copy of line 33 to 2'. Lines 31 and 34 hold two automatic variables: 1.$@ is the file name for the target of the rule. On line 31 it is: mkxmlp0.
2. $^ holds the names of all the dependences. On line 31 it is: OBJ/mkxmlp0.o. Later in the book, we'll encounter a situation where$^ is more complicated.

37-49  Task 3-Definitions & Dependencies

These lines create the .d, .sd, and .ed0 files for each item in SRCF. Here is how the lines work:

• Each time I run make, line 49 tells it to include the items in DEPF. One of those items is OBJ/mkxmlp0.d. The first time I run make, that file won't exist, so make asks itself if it has a rule to make such a file.
• Line 37 tells make that OBJ/mkxmlp0.d depends on mkxmlp0.c, and line 38 tells make to feed the command, mkpdd 2 mkxmlp0.c OBJ' to the shell. Note, $< and$(@D) are also automatic variables. They give the names of the first dependency and the target's directory (without a /'). Figure 1.23 shows mkpdd's usage message; mkpdd expects three arguments:
1. the first must be 1 or 2,
2. the second should be a C file, which need not end in .c, and
3. the third is an optional subdirectory.

So, the mkpdd command creates mkxmlp0.d and puts it in the OBJ subdirectory of the current directory. Figure 1.24 shows that OBJ/mkxmlp0.d contains two basic lines:

1. The first line contains additional dependencies for files that end in .o or .d. The line doesn't have an associated rule because make already knows how to make files that live in the OBJ subdirectory and end in .o or .d.
2. The second line is unusual. .SECONDARY is similar to .PHONEY. It is a built-in instruction that tells make to look for OBJ/mkxmlp0.ed. Initially, that file won't exist, and lines 40-41 tell make what to do. Once again make feeds a special command involving mkpdd to the shell. That command creates OBJ/mkxmlp0.ed and OBJ/mkxmlp0.sd.
 -------------------------------------------------------------- OBJ/mkxmlp0.o OBJ/mkxmlp0.d : mkxmlp0.c sp0.h \ OBJ/mkxmlp0.ed0 OBJ/mkxmlp0.sd \ sperrlog.ed0 spranfio.ed0 .SECONDARY : OBJ/mkxmlp0.ed -------------------------------------------------------------- Figure 1.24 An edited version of the dependences file, OBJ/mkxmlp0.d

For several pages I've said .ed0 files hold extern declarations. Suddenly, we've got an .ed file, and it also holds extern declarations. Why? What is going on? The answer is subtle.

If you look at lines 43-44 of our Makefile, you'll see each .ed0 file depends on its .ed file. The rule on line 44 is also a shell command, only it has two parts that are connected by ||, the shell's logical or. It works the same way C's works. Namely, if the first part is true, the shell stops and does not perform the second part. For mkxmlp0.c the parts are:

1. cmp -s OBJ/mkxmlp0.ed0 OBJ/mkxmlp0.ed
2. cp OBJ/mkxmlp0.ed OBJ/mkxmlp0.ed0.

Now, if the files are the same (their contents, not their time-stamps), the comparison is true. Note, the -s option tells cmp to work silently.

This extra step means that, every time I change mkxmlp0.c and run make, I get a new OBJ/mkxmlp0.ed. Most of the time the only thing new about the file will be its time-stamp. Its contents won't change unless I changed the declaration of an existing function or added a new extern function. Those are precisely the situations when I need a new OBJ/mkxmlp0.ed0 file, and they are the only times make makes one.

This savings is not readily apparent in the productions project because every program lives in a single source file (except for spranfio.o and sperrlog.o, and they don't change). But the front end lives in a number of source files that work with each other in a complicated way. There, I can make a change to one source, say A.c. When I run make, it will recompile A.c, but it won't recompile another file, say B.c-even if B.c includes OBJ/A.ed0-unless A's .ed0 changes.

As I said, something subtle is going on here. We'll examine this issue again when we look at the front end's Makefile.

1.1   Get an XML form of the W3C document I referred to in Example Two of section 1.2 and write a script that pulls the productions out of it.

1.2   Look back at Figure 1.3. Can you think of reasons why I made paragraph, <p>, a tag instead of treating it the way I did italics, [it]? Hint: Are there other functions, besides making chunks of stuff, that a tag might need to perform?

1.3   Pick a dozen or so words from a dictionary. Use them to make an HTML file that is similar to the file I described in Example Three of section 1.3. Your file should have a single summary area with one row for each word (broken into syllables) and its part of speech. Your detail area should have the word and its definition. The whole point of this example is getting the links right. If necessary, add junk to your detail areas, so only one or two such areas fit on your screen at one time.

1.4   Draw the picture and compute the lengths for the normal case of the two pointers problem' I described in the Programs Grow subsection of Section 1.4.

1.5   Write code for the problem in the last exercise (all the cases). Write a small program that tests your code.

1.6   Look at the usage messages for mkxmlp5 and mkpdd (see Figures 1.16 and 1.23). Each of those programs requires a number, num. It turns out that the productions programs are easier for a person to run from an xterm if num is the last argument. Why?

Chapter 2

A Quick and Dirty Parser: Version 1

[summary]   [detail]   [prev chap]   [next chap]

 Section 2.1 Introduction 2.2 Version 1's Goals and Tasks 2.3 Data Structures-The Heading 2.4 Initialization and File Input 2.5 Parsing 2.6 The Output Phase-Writing Reports 2.7 Summary

Section 2.1 Introduction      [prev sect]   [next sect]

Our goal, for the rest of Part 1, is to build a program that turns EBNF productions, similar to the ones in Figure 1.4, into the forms shown in Figures 1.7 and 1.8. I'm calling the program Quick because I built it quickly. It is Dirty because the program's code lacks elegance. For example a program of its size and complexity should be broken into several source files and have at least one header. But it was easier for me to put everything in one file, so I did. Also, the program wastes memory, again because it was easier for me to get things working that way. Actually, only the program's source code is Q&D. The program, itself, is not buggy, has good error messages, and prepares elegant output. A user, who only runs the program, would never know it was a Q&D.

Now, I didn't just sit down and write the productions program in one step. Instead I wrote it in versions. There are several advantages in growing a program the way I grew this one:

• You can test things as you go. This means, when you finish writing code, you're done with the program. It should have fewer bugs than it would if you wrote and entered the code and then tested it.
• You can work more quickly since you are spending most of your time working on simpler problems that eventually become the problem.

It's important to learn how to write code this way. So, I've decided we'll study each version just as I wrote it. That way you can see exactly what I mean when I speak of growing a program. But, before we begin version one, I want to say a few words about two topics that should be familiar.

On Space and Times

The adage-you can make a program use less memory or run faster, but you can't do both-is old and may still be true. But, it is much less important than it once was. In the early and mid eighties I frequently made a cup of coffee in the five minutes it took a program-about as complicated as the final version of the productions program-to compile. Also, I often wrestled with a program's data as I tried to fit it into a 64K segment. Today, the productions program compiles in seconds on a PC I purchased in 1997 that has 128MBs of memory. Since the PC was introduced, storage space (disk and memory) and processing time have changed enormously. But programmer's time-the time it takes to write a program-has barely changed and is much too long.

When I write code, I frequently reach a point in a program where I can't continue until I choose between two or more algorithms or storage techniques or data structures. Traditionally, programmers used space-time tradeoffs to help them make this type of decision. But, I increasingly find myself including my time to the traditional two when I make decisions. So, if I say I've done something because it was easier, I really mean that it was quicker for me to do it that way, and my decision didn't seriously impact the PC's space or time. Of course, there are times when the quick way is a poor choice and shouldn't be used. Making good decisions is an important component of a programmer's style. We'll see examples of it in future sections.

Section 2.2 Version 1's Goals and Tasks      [cur chap]   [prev sect]   [prev sect]   [next sect]   [next chap]

Version one's goals are modest. The program should accept any ASCII file as input and make a copy of the file. It will be easy to test, because we can compare the new file (via cmp) with the original one. If the two are the same, the program works. Otherwise, it has a problem. There are lots of ways to write a copy program, and most of them won't help us at all on the productions program. So, I decided version one needed to do one more thing: parse its input file into white space and text.

Now that we understand what our first program is supposed to do, it's tempting to begin looking at its code. But we shouldn't do that until you understand how I started to work on version one.

I first turned the above goals into tasks. I then realized there was not a correct way to perform any of them. Instead, there was a short question associated with each one. The answers to the questions became options, and I quickly realized that I couldn't do anything until I chose one option for each question. In other words I needed to made a number of decisions.

Decisions

The tasks and their questions are easy to state. We need to:

1. load a file-how?
2. parse it-into what?
3. write a new file-when?

Here are the options for each of the above questions:

1. There are four ways to load a file:
1. a byte at a time using getchar or an equivalent,
2. a line at a time using getline or an equivalent,
3. a block at a time using rfldblk or an equivalent, or
4. in one step-get the file's size, allocate a buffer, and read the file into memory.

Similarly, there are two times when we can write the file:

1. as we go (i.e., write something each time the parsing stage gets a new token), or
2. at the end (i.e., save the tokens until we parse the whole file and then write them).

Finally, it should be clear that however we parse, we will need a pointer to each chunk of text. But, we can work with that pointer in two ways. We can:

1. move the text to a new location and put a \0' at its end, or
2. either move the text or leave it in place but use its length to tell us where it ends.

All the choices suggests that there are many ways to write code that do what version one is supposed to do. But, that code will only be useful if I can enhance it until it solves the original productions problem. So, what should I/we do? It looks like we're about to get bogged down in our easier problem.

The first thing to do is realize that beginnings are often difficult. The second is to get the above goals, tasks, questions, and answers on paper. Then, you need to think and ask yourself, Does our problem dictate the answers to one or more of the above questions?'' By problem, I mean the productions problem, not version one.

This question turns out to be a critical one. I suggest you try and answer it, now.

If you need a hint, refer back at Figure 1.8 and look closely at exactly what we're trying to do.

If you need a second hint, ask yourself when you would know the rule number for each item on the rhs of the first production in Figure 1.8.

See?

When turns out to be a question whose answer is forced. It became the lever that I used to crack our problem. It's simply not possible in the productions problem to write the new file as it is being parsed. So, it would be a mistake to write version one using that method. Once I decided to write the new file after I've finished loading and parsing the input file, load's last option looked interesting for a number of reasons:

• We will have to build a complete representation of the input file in memory. Storing the whole file is the easiest way to do that.
• Buffered input has overhead and other complications which we will avoid if we load the file in one step.
• We can make each text token store a pointer to the start of the text and the length of the text. That means there is no need to move chunks of text.

Summary

Each of the questions associated with the program tasks turned out to have a natural option. That's a positive sign that we're on the right track. However, we need to make a final set of decisions before we can start writing code.

Section 2.3 Data Structures-The Heading      [cur chap]   [prev sect]   [prev sect]   [next sect]   [next chap]

Version one's heading holds global variables and data structures. Its global variables are straightforward; we'll look at them when we look at the actual heading. Its data structures need to do the following:

• store info for an individual chunk of text. We've decided to use a pointer and a length to do this.
• store info for an individual piece of white space. There are several ways to do this. One is to use the text structure with a length of one. Let's postpone this decision for a moment.
• link the individual items together so we can process the parsed items as a group. There are two different ways to do this:
1. allocate a block of memory at the beginning of the program and put each item in the block when it is encountered, or
2. allocate memory for the individual items as they are encountered and link them together.

Clearly the second option is more elegant than the first one, but it requires more code. I decided to use it in Part 2 and to use the one-time block here. I then realized it would be a mistake for white space to use the text structure for the following reasons:

• I want to throughly debug the basic memory representation routines in this program.
• I want to be able to add new structures easily to the basic routines.
• I know the structures for tags will be different from the one for text.

So, I decided the white space structure should hold a code identifying the actual character and a character count.

After I made those decisions I began to draw possible blocks on paper. Figure 2.1 shows my final layout. I call it the main token block (MTB). It has three parts:

1. a fixed size header. This holds the small number of global variables I'll need to work with the block.
2. variable size data elements. These correspond to the chunks I parse the file into. In version one they are white space and text. They grow up the block.
3. fixed size tokens. In the representation I chose there is not a way to get directly into the data elements. Instead, each one has an associated token which holds a code identifying the type of element and a pointer to the element. The tokens begin at the end of the block and grow down into it. This extra level of indirection is a partial consequence of my decision to put things in a single big block.

Figure 2.1 Header, data elements, and tokens in the main token block

Remember our shell, mkxmlp0.c, from the last chapter? I began work on version one by copying it to mkxmlp1.c. In the new file I changed each instance of mkxmlp0 to mkxmlp1; adjusted the usage message; and transfered the data structures I had on paper to the new header. Figure 2.2 shows the first part of the finished product. It is basically unchanged from the shell.

------------------------------------------------------------------------- mkxmlp1.c: heading1
2  /* mkxmlp1.c   Process & format XML production rules
3                 This version breaks document into text and whitespace. */
4
5  char tusage[] =
6  "\n"
7  "usage:\n"
8  "\n"
9  "mkxmlp1 fname   num\n"
10  "\n"
11  "       fname name of input file to work with.\n"
12  "\n"
13  "       num = 0 check 1 file\n"
14  "\n"
15  "     output file names:\n"
16  "            fname.chk      num = 0\n"
17  "\n"
18  "       exit status:\n"
19  "            0 - ok.\n"
20  "            1 - usage error.\n"
21  "            2 - other errors.\n"
22  "\n";
23
24  #include <stdio.h>
25  #include <stdlib.h>
26  #include <syslog.h>
27  #include <unistd.h>
28  #include <stdarg.h>
29
30  #include "sp0.h"
------------------------------------------------------------------------- mkxmlp1.c: heading1
Figure 2.2 Part 1 of mkxmlp1.c's heading: usage message and include files

Globals

Figure 2.3 shows mkxmlp1's global constants and variables. I put the variables in a structure so that, later in a function, I can tell at a glance if a variable is in fact a global. If it is, and I've forgotten what it does, I know exactly where to look for it. I build this structure as I write a program's code. If I discover something several functions need, and I decide it's easier not to pass the item around as a parameter, I put it here.

------------------------------------------------------------------------- mkxmlp1.c: heading2
34  /* globals used for several nums */
35
36  #define MAXFNL  60  /* max length of file name */
37  #define MAXELL 150     /* maximum fatal error line length */
38  #define MLTOKB 100000  /* max length of main token block */
39  #define WBLEN  1024    /* max length of write buffer */
40
41  struct tfiles       /* holds GLOBAL variables */
42  {
43       char fname[MAXFNL+1];    /* file name entered on execution line */
44       char fnchk[MAXFNL+5];    /* 'check' file name */
45       char wbuf[WBLEN+1];      /* write buffer used with mfillbuf() */
46       char emb[MAXELL+1];      /* message buffer for fatal errors */
47       char *pbuf;              /* pointer to start of input file */
48       int tlen;                /* length of input file */
49       struct tblock *ptb;      /* pointer to main token block: see below */
50  };
51  struct tfiles tf;
------------------------------------------------------------------------- mkxmlp1.c: heading2
Figure 2.3 Part 2 of mkxmlp1.c's heading: global constants and variables

Main Token Block

A program's data structures are one of its most important parts. It's difficult to comment them too heavily. You could read my code and comments and figure out what's going on here, but instead I want to look briefly at each structure. You must obtain a global understanding of the data the structures hold before you can write or understand the code that uses them.

55-67  MTB Comment

This comment is my reminder of how things should work. I can read it and remember what I've done much easier than I can reconstruct things from the code. These lines are another explanation of Figure 2.1. Notice the typo on the second line of item 2.

68-76  MTB Header: tblock

I find that structures like this one have two kinds of items: required and convenience. The required items are generally easy to decide on, and they are the ones I initially put in the structure. Then, later, when I write the code that manipulates the structures, I add a convenience item if it's easier to store the value than it is to recompute it. At least, that's the theory. Sometimes, I write the whole thing at once. If you look closely here, there is only one required item: numt. Also, there are two items that never get used. As you read the code, try to identify the extra items.

77-82  Tokens: ttoken

Each data element requires its own token which holds the element's id number and a generic pointer to the element. When I prepare a report, I will step through the tokens and use its id to recast its pointer to the appropriate data element.

------------------------------------------------------------------------- mkxmlp1.c: heading3
55  /* main token block -- This has three sections:
56       1. A fixed length segment for tblock.
57                 This begins at the start of the block.
58       2. Variable length items for types of tokens: e.g., ttext, twspace,
59          tlement, tattribute, ... .
60                 This starts immediately after segment 1 and grows up the block.
61       3. Fixed length items for ttoken.
62                 This starts at the end of the block and grows down the block.
63
64    | tblock | elements ... ->                                   <-...tokens |
65    |________________________________________________________________________|
66    | <- ...              main token block                            ... -> | */
67
68  struct tblock       /* header at start of main token block */
69  {
70       int free;                /* amount of free space in block */
71       int numt;                /* number of tokens in block */
72       struct ttoken *ptok0;    /* pointer to first token in block */
73       void *pd0;               /* pointer to first data element */
74       void *pd;                /* pointer to next free data element */
75  };
76
77  struct ttoken       /* tokens at end of token block */
78  {
79       int id;        /* unique id (> 0) for token: see tokids[]. */
80       void *pde;     /* pointer to data element for this token */
81  };
82
83  struct ttext        /* text data element */
84  {
85       char *pt;      /* pointer to start of text (no white space) */
86       int   lt;      /* length of text */
87  };
88
89  struct twspace      /* whitespace data element: holds space, \n, \t, \r */
90  {
91       unsigned int tid : 4;    /* id for type of space: see whsp[] & ttkid[] */
92       unsigned int nc  :28;    /* # of chars of this type */
93  };
94            /* 012 3 4 */
95  char whsp[]="  \n\t\r";   /* types of whitespace. see tid in struct twhspace */
96
97  struct tokids
98  {
99       int tid;  /* unique id # for this data element */
100       int sr;   /* space this data element requires */
101  };
102  struct tokids ttkid[] =
103  {
104      {0,0},                         /* starts the list. DO NOT USE */
105      {1,sizeof(struct twspace)},    /* twspace - ' '  space */
106      {2,sizeof(struct twspace)},    /* twspace - '\n' new line */
107      {3,sizeof(struct twspace)},    /* twspace - '\t' tab */
108      {4,sizeof(struct twspace)},    /* twspace - '\r' carriage return */
109      {5,sizeof(struct ttext)},      /* ttext */
110
111      {0,0}   /* ends the list. DO NOT USE */
112  };
------------------------------------------------------------------------- mkxmlp1.c: heading3
Figure 2.4 Part 3 of mkxmlp1.c's heading: data structures for elements and tokens

83-88  Text Data Element: ttext

This holds the two items we decided on above. Both are required.

89-96  White Space Data Element: twspace

There are many ways to work with white space. I decided to use a code and a character count. The code is 1-4 and corresponds to the position of the character in whsp[] (note the first two characters in that array are spaces).

97-112  Token id #s & Space for Data Elements: ttkid[]

The tokids structure and the corresponding array ttkid[] allow me to recompute the convenience items in tblock easily when I add a data element to the MTB. I've set this up so that, when I add new types of data elements in version 2, I can prepare a new structure similar to ttext and add a line to ttkid[] for each one.

Warts

I intentionally set up four white space tokens in ttkid[] so the functions that maintain tblock would get a workout. But, the way I've specified its 1-4 value is klutzy. I've used it as the index to whsp[], the tid value in tokids, and the index to ttkid[]. Before I finished the productions program's last version, I wished I'd used a name (either via enum or #define) for the tid number in tokids. Several times I put the wrong number at a point in my code and then had to climb into the debugger to figure out why things didn't work correctly. I decided to leave these warts here since seeing things to avoid doing is often as helpful as seeing something clever. Also, we'll do them correctly in Part 2.

Declarations

The lines in Figure 2.5 should be familiar from the shell in the last chapter. Since mkxmlp1.c is a single file, I could have made each of its functions static. But I didn't.

------------------------------------------------------------------------- mkxmlp1.c: heading4
121  #include "OBJ/mkxmlp1.ed0"
122  #include "OBJ/mkxmlp1.sd"
123  #include "sperrlog.ed0"
124  #include "spranfio.ed0"
------------------------------------------------------------------------- mkxmlp1.c: heading4
Figure 2.5 Part 4 of mkxmlp1.c's heading: function declarations

Summary

This completes the heading. After I typed in the above lines, I compiled the program until I didn't get any error messages. That got my typos out of the code, but it left ones in my comments.

Section 2.4 Initialization and File Input      [cur chap]   [prev sect]   [prev sect]   [next sect]   [next chap]

Earlier in this chapter I described the three tasks that version one needs to perform: load, parse, and write. Actually, there is a fourth task: initialization. Since the load task is straightforward and takes only a few lines, I've decided to discuss it and initialization together. When our program completes them, it will be ready to begin parsing. Figure 2.6 shows the functions mkxmlp1.c uses in this phase.

Figure 2.6 Functions in initialization and load phase of mkxmlp1.c

The following comments should help you read Figure 2.6 and similar figures that appear in the next several chapters:

• If the line surrounding a function's name is solid, the code for the function is in the current section.
• If the line is made up of dots, the function didn't change from an earlier version. The version number, where the function's code can be found, appears as a superscript at the end of the function's name.
• If the line is dashed, the function is actually in a different phase and is described in another section of the current chapter.
• If a line connects two functions, the one on the higher level calls the other one.
• If the line ends with an arrowhead, control never returns to the caller.
• If the line is solid, the call is always made.
• If it is dashed, it is made only if there is some type of error.
• If it has a black dot near one end, the called function returns one or more values to the caller.

These functions contain about a hundred lines of code, but I actually wrote fewer than a dozen lines for this program. I hacked the rest, including all the function names and most of main's organization, from a program I'd written earlier. As we'll see below, code can migrate for years from program to program. With that confession, let's get started. Figure 2.7 shows our new main.

129-144  The Main Routine

If you compare these lines with the ones in mkxmlp0.c, you'll see the new lines are several function calls that do initialization tasks and three calls that do the program's primary tasks: load, parse, and write.

----------------------------------------------------------------------------- mkxmlp1.c: main
129  int main(int argc,char *argv[])
130  {
131       int terr,tnum;
132
133       errlog_init("mkxmlp1",LOG_PID,LOG_LOCAL0);
134       terr = 0;
135       tnum = chkargs(argc,argv);
136       if (tnum != 0)
137            usage();
138       inittf(tnum,argv[1]);
139       ldfile(tf.fname);
140       procfile();
141       mkreport();
142       endprog(terr);
143       return(0);     /* keeps compiler happy */
144  }
----------------------------------------------------------------------------- mkxmlp1.c: main
Figure 2.7 The main routine in mkxmlp1.c

172-186  Check Program Arguments: chkargs

This function performs a first level' check of our program's arguments. If one of its checks fails, chkargs calls usage which in turn displays the message in tusage. Remember, the productions program requires two arguments: a file name and a number. This function checks that the name will fit in tf.fname[]. It also checks that the file actually exists and that we can read it. Finally, it returns the number without validating it! In this version the number must be 0, but in future versions other values are possible. I've set things up like this so I can test this function once, now, and then forget about it. It won't change for the rest of Part 1. As our program evolves, main will change. That's where I'll test the number.

-------------------------------------------------------------------------- mkxmlp1.c: chkargs
172  int chkargs(int argc,char *argv[])
173  /* check program arguments: return tnum. NO return if error */
174  {
175       int tnum;
176
177       if (argc != 3)
178            usage();
179       if (strlen(argv[1]) > MAXFNL)
180            error("file name: %s is too long. max length is: %d",argv[1],MAXFNL);
181       if (access(argv[1],R_OK) != 0)
182            error("Cannot open data file: %s. Check its spelling",argv[1]);
183
184       tnum = atoi(argv[2]);
185       return(tnum);
186  }
-------------------------------------------------------------------------- mkxmlp1.c: chkargs
Figure 2.8 The check arguments routine in mkxmlp1.c

Perform Initialization: inittf

Figure 2.9 shows the lines in inittf. They:

1. initialize the global variables in tf, and
2. setup the main token block.

188-197  Global Variables

These assignments are straightforward. When I discover something new that needs to be global, I add it to tf in the heading and initialize it here.

198-207  Setup MTB

These lines implement Figure 2.1 and the discussion that preceeded it. The items pd0 and ptok0 correspond to D0 and T0 in that figure.

--------------------------------------------------------------------------- mkxmlp1.c: inittf
188  void inittf(int tnum,char *pfn)
189  /* initialize tfiles data */
190  {
191       char *pt;
192       struct tblock *ptb;
193
194       strcpy(tf.fname,pfn);
195       sprintf(tf.fnchk,"%s.chk",tf.fname);
196       tf.tlen = 0;  /* nothing in buffer to process */
197
198    /* allocate & setup main token block */
199       pt  = myalloc(MLTOKB);
200       ptb = (struct tblock *)pt;
201       tf.ptb     = ptb;
202       ptb->free  = MLTOKB - sizeof(struct tblock);
203       ptb->numt  = 0;
204       ptb->ptok0 = gettokp(0);
205       ptb->pd0   = (void *)(pt + sizeof(struct tblock));
206       ptb->pd    = ptb->pd0;
207  }
--------------------------------------------------------------------------- mkxmlp1.c: inittf
Figure 2.9 The primary initialization routine in mkxmlp1.c

210-218  Allocate a Buffer: myalloc

Every time you request a block of memory, you need to check that the block is valid. Actually, there is nothing special about calls to calloc. A responsible programmer tests every system call that can fail. Part of programming is deciding in advance what to do if a system call does fail. Frequently, all you can do is print a message and quit. Although I've tested the calls to calloc here and to access in chkargs, I've done so differently. The call to access only occurs in chkargs, so I put the test directly in the code. But several different functions need a block of memory. So, I put the calloc test in a convenience function, myalloc, and call it. Using convenience functions promotes clean code that is easy to read and debug.

-------------------------------------------------------------------------- mkxmlp1.c: myalloc
210  char *myalloc(int ts)
211  /* allocate ts bytes & return pointer to area, call error() if error */
212  {
213       char *pt;
214
215       if ((pt = (char *)calloc(ts,1)) == NULL)
216            error("cannot allocate %d bytes",ts);
217       return(pt);
218  }
-------------------------------------------------------------------------- mkxmlp1.c: myalloc
Figure 2.10 The memory allocation routine in mkxmlp1.c

Load a File: ldfile

This function illustrates the power of convenience functions. Recall that we've decided to load the whole input file into a buffer. It's easy to make a list of the steps that will do that:

1. open the file,
2. get its size,
3. use the size to allocate a buffer,
4. read the file into the buffer,
5. close the file, and
6. tidy up.

Each of the first 5 items requires one or more system calls. Writing that code doesn't look like fun. Now, look at the code in Figure 2.11.

220-229  Load the File

The last 5 lines in this section perform the above 5 steps. Boom. When they are done, the file is in a buffer. We're done with input, and we've tested every system call. Each of the rf functions is in spranfio.c. With one exception it's not important for you to understand, now, how they work. We'll build similar functions in Part 2. The exception is the variable trfi that rfopen returns. If you look closely, you'll see that every other rf function has trfi as its first parameter. That is because the routines allow a program to have multiple files open at the same time, and they use a small non-negative integer, trfi, to keep things straight.

230-233  Tidy Up

This function dates back to the early 90s. At that time I was working with ASCII data from several commercial programs. One program put a CTRL-Z (ASCII 26) at the end of its files. That was a pain. These lines test for a CTRL-Z and backup up one character if they find one. Sometimes, I process a buffer until I hit a \0'. Other times, I process a file's actual length. My ldfile routine sets things up so I can do either one.

234-236  Warts

This function really should return both a pointer to the new buffer and its length, but I've hardwired those items here. That doesn't present a problem in the productions program, because it only loads one file.

--------------------------------------------------------------------------- mkxmlp1.c: ldfile
220  void ldfile(char *pfnam)
221  {
222       char *pt,*pt1;
223       int trfi,tlen;
224
225       trfi = rfopen(pfnam,0,-1);
226       tlen = rfsize(trfi);
227       pt = myalloc(tlen + 1);
228       rfldblk(trfi,pt,0,tlen);
229       rfclose(trfi);
230       pt1 = pt + tlen - 1;
231       if (*pt1 != '\x1A')      /* left over from DOS days */
232            pt1++;
233       *pt1 = '\0';
234       tf.pbuf = pt;
235       tf.tlen = tlen;
236  }
--------------------------------------------------------------------------- mkxmlp1.c: ldfile
Figure 2.11 The load file routine in mkxmlp1.c

Summary

I compiled each of these functions after I entered it. That eliminated typos. Also, I verified that each test in chkargs actually worked. At this stage the only place that might have a problem is the MTB that was prepared in inittf. I can't test those lines until I parse.

Section 2.5 Parsing      [cur chap]   [prev sect]   [prev sect]   [next sect]   [next chap]

It's now time to begin mkxmlp1.c's real work. Figure 2.12 shows the functions in the parsing phase. The longest and most complicated one is procfile. It moves through the buffer we loaded in the previous section a character at a time and prepares the necessary data elements and tokens.

Figure 2.12 Functions in parse phase of mkxmlp1.c

I didn't just sit down and write procfile. First, I prepared a state diagram that described what I was trying to do. The diagram had two processing states and one goal state, and I drew it using the following rules:

• Begin: Move to state one.
• State one: Read a character. If the character is white space, remember it and move to state two, but reprocess that character in state two. Otherwise, stay in state one. Loop.
• State two: Read a character. If it is the remembered' character, stay in state two. Otherwise, move back to state one, but reprocess that character in state one. Loop.
• In either state, if the character is \0', move to the goal state. Processing is done.

My rules have instructions such as Begin' and Loop'. They should be clear. Figure 2.13 shows my completed diagram. The above rules contain slightly more information than the actual diagram: namely, the reminder to reprocess the character that caused you to change states. I don't know how to put processing instructions like that in a state diagram.

Figure 2.13 State diagram for parsing ASCII data into text and white space

Parsing: procfile

Generally, I find it straightforward to turn a state diagram into code, but I often find the resulting code confusing to read until I rebuild the logic/diagram behind the code. Figure 2.14 shows procfile. This is the first real' code we've seen, and I want to make several additional comments about it.

240-248  Variable Names

Naming variables is a personal matter. Here are some conventions I follow:

• There are two schools of thought about where you should declare a function's local variables:
1. at the beginning of the function, or
2. spread throughout the function. Here, you declare a variable just before you use it.

I always declare all a function's variables at the beginning of the function. That way I know where to look for a declaration if I encounter a variable somewhere in the function and can't remember what it does.

• If a variable is a pointer, I always start its name with a p'. Character pointers are quite common. I use pt' or pbuf' exclusively with them. Also, I try to avoid using p' as the first letter of other variables.
• Some variables only take two values. They are effectively switches. I always end their names with sw'. Recall that sp0.h (shown in Figure 1.17) defines the values switches use.
• I rarely use typedefs, particularly for the second use that [K&R] state at the end of their section 6.7.

Each of the above items is a matter of style. There is not a correct way to do any of them. But, they describe how I work.

249-256  Initialize, Loop, and Switch

Most of procfile takes place in a switch that has two cases. The switch, in turn, sits in a loop. Before it starts, I need to initialize several variables:

• pt. This line is a good illustration of several comments I just made about variable names. You can look at pt or pbuf and know each is a character pointer. The tf tells you pbuf is a global variable. When the loop starts, pt points to the first character in the buffer we need to parse.
• ptok. This either points directly to a valid token and indirectly to a valid data element (Ti and Di in Figure 2.1) or is NULL. The latter is my clue that I need to prepare a new token and data element in the MTB.
• rcsw. This switch keeps track of whether I need to reprocess the current character or not. The default action is not to reprocess.

There are two ways to terminate processing of a buffer. You can:

1. watch for a special character (often \0') and quit when you encounter it, or
2. keep track of the number of characters you've processed and quit when that number becomes the buffer's length.

This program uses the first method. Part 2 uses the second method.

257-276  Process Text

This code's large if statement mirrors Figure 2.13's first state.

• Figure 2.15 shows the white space test, wstest. It returns 0-4. A positive value gives the character's position in whsp[].
• The first non-white space character is special; then ptok is NULL, and the code sets ptok to the next free token. Remember, text's tid value is 5. Using it instead of something like #define TEXTTOK 5' is a wart. After the call to getnft, ptok and ptok->pde are both valid pointers, but the text data element corresponding to ptok->pde still needs to be initialized.
• You should verify that after the first regular character the length, ptxt->lt, is 1.

277-296  Process White Space

The large if statement here is functionally equivalent to the previous one with two minor differences:

1. You don't need to call wstest. Instead, you can directly compare the current character, tc, with the current white space character, cws, that was set in case 10.
2. You should use the tid value from case 10 in the call to getnft. Remember, there are four different white space data elements, and tid identifies the correct one.

297-299  Reprocess?

This test is a clean and easy way to reprocess the current character exactly when you need to (i.e., when the state changes).

300-302  Final Messages

These messages are really for debugging. But, I like the information they provide, so I left them in the program.

------------------------------------------------------------------------- mkxmlp1.c: procfile
240  void procfile(void)
241  /* process 1 file */
242  {
243       char *pt,tc,cws;
244       int state,rcsw,tid;
245       struct ttoken *ptok;
246       struct ttext *ptxt;
247       struct twspace *pws;
248
249       pt    = tf.pbuf;
250       ptok  = NULL;
251       state = 10;
252       while ((tc = *pt) != '\0')
253       {
254            rcsw = NO;    /* reprocess character switch: YES or NO */
255            switch(state)
256            {
257                 case 10:       /* have text */
258                      if ((tid = wstest(tc)) > 0)
259                      {
260                           state = 20;
261                           cws = tc;      /* set current ws char */
262                           ptok = NULL;
263                           rcsw = YES;
264                      }
265                      else
266                      {
267                           if (ptok == NULL)
268                           {
269                                ptok = getnft(5);
270                                ptxt = (struct ttext *)ptok->pde;
271                                ptxt->pt = pt;
272                                ptxt->lt = 0;
273                           }
274                           ptxt->lt++;
275                      }
276                      break;
277                 case 20:       /* have white space of type cws */
278                      if (tc == cws)
279                      {
280                           if (ptok == NULL)
281                           {
282                                ptok = getnft(tid);
283                                pws = (struct twspace *)ptok->pde;
284                                pws->tid = tid;
285                                pws->nc  = 0;
286                           }
287                           pws->nc++;
288                      }
289                      else
290                      {
291                           state = 10;
292                           ptok = NULL;
293                           rcsw = YES;
294                      }
295                      break;
296            }
297            if (rcsw == NO)
298                 pt++;
299       }
300       printf("processing of %s complete.\n",tf.fname);
301       printf("tokens: %d, bytes free: %d.\n",tf.ptb->numt,tf.ptb->free);
302  }
------------------------------------------------------------------------- mkxmlp1.c: procfile
Figure 2.14 The parse routine in mkxmlp1.c

309-318  White Space Test: wstest

This function is supposed to take a character and return 0 if the character is not white space or the tid value in ttkid[] of the white space character. It does that by using the link, which is a wart, between ttkid[] and whsp[]. If you look back to the definition of whsp[] in Figure 2.4, you will see that it contains five characters. The last four correspond to this program's white space characters. It's first character, whsp[0], is unused. In Part 2, we will have several special sets of characters and will need to test if a character is in one of the sets. We'll use a much better method there to make each test.

--------------------------------------------------------------------------- mkxmlp1.c: wstest
309  int wstest(int tc)
310  /* test for white space. Return 0 if fail or 1-4 for whsp[] & ttkid[] id */
311  {
312       int i;
313
314       for (i = 1; i <= 4; i++)
315            if (tc == whsp[i])
316                 return(i);
317       return(0);
318  }
--------------------------------------------------------------------------- mkxmlp1.c: wstest
Figure 2.15 Test if a character is white space

320-331  Get Pointer to a Token: gettokp

This routine is deceptive. It looks simple, but it has a subtle twist that is easy to miss. You should see the twist if you ask yourself, Where does the first token, D0, start?'' If you need to, look back to Figure 2.1. The answer is, Exactly one ttoken width from the end of the MTB.'' That's the reason tnum is increased by 1.

-------------------------------------------------------------------------- mkxmlp1.c: gettokp
320  struct ttoken *gettokp(int tnum)
321  /* Return pointer to token tnum in the main token block.
322     First token has tnum = 0. */
323  {
324       struct ttoken *ptok;
325       char *pt;
326
327       tnum++;
328       pt   = (char *)tf.ptb;
329       ptok = (struct ttoken *)(pt + (MLTOKB - tnum*sizeof(struct ttoken)));
330       return(ptok);
331  }
-------------------------------------------------------------------------- mkxmlp1.c: gettokp
Figure 2.16 Get the pointer to a particular token in the MTB

Get the Next Free Token: getnft

This function's argument should be a valid index to ttkid[] (i.e., it should correspond to a data element). If it is and there is space, the function prepares and returns the next free token in the MTB. At the start of the function the header, tblock, is consistent. The real work here, which is basically bookkeeping, is to make sure it still is when the function returns.

333-341  Program Defensively

When I write code, I am constantly asking myself, How can my code fail?'' I then try to use that understanding and write code that anticipates potential problems. That's part of defensive' programming. For example, the test at the end of these lines is not really necessary here since this version only calls getnft twice. But, future versions will call it more often and from several different functions. So, it's a good idea to verify the parameter, which I'm about to use as an index to ttkid[], is valid and not off in never-never land.

342-344  Worry About out of Space

The previous test might have been optional, but the one shown here is not. This program will fail if we run out of space in the MTB. This test make it fail gracefully.

345-352  Allocate & Keep the Books

If you look back at where the MTB was initialized in Figure 2.9, you will see that tf.ptb->pd was set to the first data element (i.e., it is valid at the start of this function). That means that the new token, ptok, can use that value when it sets ptok->pde. But, it also means tf.ptb->pd must be reset. The penultimate instruction does exactly that.

--------------------------------------------------------------------------- mkxmlp1.c: getnft
333  struct ttoken *getnft(int tid)
334  /* init and return next free token for this id. NO return if out of space */
335  {
336       int sr,mid;
337       struct ttoken *ptok;
338
339       mid = sizeof(ttkid)/sizeof(struct ttoken) - 1;
340       if ((tid <= 0) || (tid >= mid))
341            error("undefined tid value (tid = %d) in getnft()",tid);
342       sr = ttkid[tid].sr + sizeof(struct ttoken);
343       if (tf.ptb->free < sr)
344            error("out of space in main token block");
345       ptok = gettokp(tf.ptb->numt);
346       ptok->id  = tid;
347       ptok->pde = tf.ptb->pd;
348       tf.ptb->free -= sr;
349       tf.ptb->numt++;
350       tf.ptb->pd += ttkid[tid].sr;
351       return(ptok);
352  }
--------------------------------------------------------------------------- mkxmlp1.c: getnft
Figure 2.17 Get a pointer to the next unused (free) token in the MTB

Summary

This program's parsing is modest. It will greatly change in the next version. But, the MTB will not change. True, it will hold additional types of data elements, but ttkid[]'s layout allows new ones to be easily added. More importantly, our last functions-wstest, gettokp, and getnft-will not change. After I debugged them in this program, I was able to forget about them.

Section 2.6 The Output Phase-Writing Reports      [cur chap]   [prev sect]   [prev sect]   [next sect]   [next chap]

Unfortunately, it turns out to be a good deal more complicated to get material out of our program than it was to load our original file. One reason for the extra complexity is that it's not really possible to predict how big our new file will be. So, we can't allocate a single buffer for it. Instead, we have to buffer our output. A second reason is that this phase is where the real work of our program takes place. These two reasons correspond to the kinds of functions in this phase. We have:

1. a small number of general purpose functions that any dual-publishing program might use. They are designed to handle all aspects of moving data to the output buffer and writing, as necessary, that buffer to the output file.
2. application specific functions. These do the application's work. Periodically, they prepare a chunk of text that needs to get written to the output file. They don't know how the buffering works. All they know is that if they call a function and include the data they need to write as one of its arguments then, later, the new material shows up in the proper form in the new file.

Figure 2.18 shows the functions in this phase. Only mkreport is an application function. The other new ones are general purpose.

Figure 2.18 Functions in report phase of mkxmlp1.c

Writing Reports: mkreport

Remember, all we're trying to do in version one is take the parsed data that we prepared in the previous section and rebuild the input file. You should be able to sit down and prepare a list of steps to do that. Here is my to-do list. We need to:

1. open our output file,
2. take a token and print its corresponding data element,
3. repeat the previous step for all the tokens, and
4. close the file.

356-368  open and setup

These lines declare our variables, open the report file, and initialize certain variables. They use two global items in tf (see Figure 2.3):

1. tf.fnchk. This holds the name of the report file. We prepared it in inittf (Figure 2.9). If you look closely at the call to rfopen here and the one in ldfile (Figure 2.11) you will see the parameters are different. There, we wanted to open the file and then read it. That call would generate a fatal error if the file didn't exist. That's why we checked earlier in chkargs (Figure 2.8) that the file did exist. Here, it doesn't matter if the output file exists or not. In fact, if it does, we need to truncate it so that we can build a new one.
2. tf.wbuf. This is our output buffer. It's length is WBLEN. I've used pt0 and ml for them because they appear on every call to several of the general purpose functions (e.g., mfillbuf and mfillbufc, below), and I want each call to fit, if possible, on one line.

There are also two other items:

1. trfi. Remember, after we open a file, this becomes our handle to it. When we get ready to write something, it is required.
2. fi. This remembers' where we are in the buffer. When we add material to it, this is the index we need to fill first.

369-387  Process Each Token

These lines perform the loop that corresponds to steps 2 and 3 on my to-do list. After we get a token, we use its id value and recast its generic pointer so it points to the proper data element. That means we need to add to our buffer either:

• a white space character, or
• a chunk of text.

In the first case we know how many times the character occurs. In the second we know where the text starts and its length. The two general purpose functions, mfillbufc and mfillbuf, were designed to handle precisely those cases. If their arguments appear intimidating, don't panic. I'll discuss them in a moment when we look at mfillbuf. Note: the first test I made in getnft guarantees that only the five cases listed here can occur.

388-390  Cleanup and Close

I hope you don't have the bad habit of allowing the operating system to cleanup after you and close your files. It will. In fact, it will even flush each file's buffer before it does. But, it can't flush tf.wbuf because it doesn't know that we are using it as a buffer for tf.fnchk. So, we need to flush it ourselves and then close the file. That's what these lines do.

------------------------------------------------------------------------- mkxmlp1.c: mkreport
356  void mkreport(void)
357  /* control output phase */
358  {
359       char *pt0,tc;
360       struct ttoken *ptok;
361       struct ttext *ptxt;
362       struct twspace *pws;
363       int i,ml,fi,trfi;
364
365       trfi = rfopen(tf.fnchk,1,-1);
366       pt0 = tf.wbuf;
367       ml  = WBLEN;
368       fi  = 0;
369       for (i = 0; i < tf.ptb->numt; i++)
370       {
371            ptok = gettokp(i);
372            switch(ptok->id)
373            {
374                 case 1:        /* white space */
375                 case 2:
376                 case 3:
377                 case 4:
378                      pws = (struct twspace *)ptok->pde;
379                      tc  = whsp[pws->tid];
380                      fi  = mfillbufc(pt0,fi,ml,tc,pws->nc,trfi);
381                      break;
382                 case 5:        /* text */
383                      ptxt = (struct ttext *)ptok->pde;
384                      fi = mfillbuf(pt0,fi,ml,ptxt->pt,ptxt->lt,trfi);
385                      break;
386            }
387       }
388       linkrfw(trfi,pt0,fi);    /* flush left over data */
389       rfclose(trfi);
390  }
------------------------------------------------------------------------- mkxmlp1.c: mkreport
Figure 2.19 mkreport controls the output phase in mkxmlp1.c

This function is our link to one of rf's output functions. That function, rfwrite, requires three things:

• trfi, the file's handle,
• pbuf, the start of the buffer to write, and
• tlen, the number of characters in the buffer to write.

They become linkrfw's parameters. If you look back to the call to linkrfw in Figure 2.19, you will see its arguments were: trfi, pt0, and fi-a handle, a pointer, and a length. A valid trfi is non-negative. Also, there is no need to write something if the length is zero. This function silently traps those cases and only passes its arguments on to rfwrite when it knows there is something for rfwrite to do.

-------------------------------------------------------------------------- mkxmlp1.c: linkrfw
394  void linkrfw(int trfi,char *pbuf,int tlen)
395  /* link to rfwrite. This allows trfi = -1 & tlen = 0 */
396  {
397       if ((trfi < 0) || (tlen < 1))
398            return;
399       rfwrite(trfi,pbuf,tlen);
400  }
-------------------------------------------------------------------------- mkxmlp1.c: linkrfw
Figure 2.20 linkrfw writes our buffer to a file

Add Text to a Buffer: mfillbuf

Before we look at this function's code, I want to review the problems we need it to solve:

• we have a buffer that is partially filled with data,
• we have some text whose length we know,
• we want to add the text to the buffer, and
• we want to write the buffer to a file whenever it gets full.

We will need the following variables to do the above tasks:

1. a pointer to the main buffer,
2. the maximum number of characters the buffer can hold,
3. where we are in the buffer,
4. the file's handle,
5. a pointer to the start of our text, and
6. the length of the text.

It's tempting to treat several of those variables, perhaps the first four, as global and put them in tf. That would be a mistake, because in version 3 we will need to prepare several output files. We might even work on two at the same time. Global variables would quickly become unwieldy. Instead, every time we call mfillbuf, we must pass it the above pieces of information. When it is done, it returns where it is in the buffer. We store that value and use it on our next call.

402-412  Declaration and Comments

These lines repeat the above discussion. If you read them closely, you'll discover one subtle twist that I've not yet mentioned.

413-419  Tests

The first test shouldn't happen, but if it does, we get out quickly. The second is the above subtlety. Version one doesn't use these lines, but all future versions will. So, I've included them here. One of this chapter's exercises should help you see how to exploit this feature.

420-433  Copy the Text

This loop copies the correct number of characters from our text to the output buffer. Before it adds each character, it checks to see if the buffer is full. If it is, the function writes the buffer (via linkrfw) and resets its index. The counter, tl, is only increased when a character is actually moved.

The value mfillbuf returns is always the first index it should use the next time it is called.

------------------------------------------------------------------------- mkxmlp1.c: mfillbuf
402  int mfillbuf(char *pbuf,int fi,int ml,char *pt,int ml1,int trfi)
403  /* progressively fill a buffer with a string:
404       pbuf - buffer to fill.
405       fi   - start at pbuf[fi].
406       ml   - max length of pbuf (so pbuf[ml-1] is last non null char filled).
407       pt   - string to add.
408       ml1  - 0, pt ends in \0 so add to there;
409             >0, use at most ml1 chars in pt.
410     Put '\0' at end of buffer & return position of \0.
411     As necessary, write to trfi. */
412  {
413       int tl;
414
415       if ((pt == NULL) || (ml1 < 0))
416            return(fi);
417
418       if (ml1 == 0)
419            ml1 = strlen(pt);
420       tl = 0;
421       while (tl < ml1)
422       {
423            if (fi < ml)
424                 pbuf[fi++] = pt[tl++];
425            else
426            {
428                 fi = 0;
429            }
430       }
431       pbuf[fi] = '\0';
432       return(fi);
433  }
------------------------------------------------------------------------- mkxmlp1.c: mfillbuf
Figure 2.21 mfillbuf copies text to the output buffer

437-461  Add a Character to a Buffer: mfillbufc

All the parameters here are identical to the ones in mfillbuf with one exception, a character replaces the pointer to the text. This code is almost identical to mfillbuf's and shouldn't present any difficulties.

------------------------------------------------------------------------ mkxmlp1.c: mfillbufc
437  int mfillbufc(char *pbuf,int fi,int ml,char tc,int ml1,int trfi)
438  /* Progressively fill a buffer with a character:
439       parameters same as mfillbuf() except for char tc.
440     Put '\0' at end of buffer & return position of \0.
441     As necessary, write to trfi. */
442  {
443       int tl;
444
445       tl = 0;
446       while (tl < ml1)
447       {
448            if (fi < ml)
449            {
450                 pbuf[fi++] = tc;
451                 tl++;
452            }
453            else
454            {
456                 fi = 0;
457            }
458       }
459       pbuf[fi] = '\0';
460       return(fi);
461  }
------------------------------------------------------------------------ mkxmlp1.c: mfillbufc
Figure 2.22 mfillbufc add a character to the output buffer

Summary

Two of the above functions, linkrfw and mfillbufc, will not change for the rest of the productions program. Also, mfillbuf won't change until version 5.

Section 2.7 Summary      [cur chap]   [prev sect]   [prev sect]   [next sect]   [next chap]

Version one is complete. Figure 2.23 shows my results when I used mkxmlp1's source code as its input file. If those results are viewed from outside our program, they don't look impressive. But, if you look under the hood, as we've done, you will see what we've accomplished. Our MTB works. We should be able to add new data elements to it easily. Also, the output routines that do our buffering are almost done and won't need attention for several versions. Our program is growing.

In version 2 we'll focus on parsing and make it handle start and end tags, and attribute names and values. That means the program will no longer work with any ASCII file. In version 3 we'll assume the XML data is made up of productions and begin to make reports appropriate for them.

 -------------------------------------------------------------------- ~/doc/xml/.prules -->$mkxmlp1 mkxmlp1.c 0 processing of mkxmlp1.c complete. tokens: 3391, bytes free: 53060. ~/doc/xml/.prules -->$ ll mkxmlp1.* -rw-rw-r-- 1 david david 12144 Feb 6 15:16 mkxmlp1.c -rw-rw-r-- 1 david david 12144 Feb 6 19:20 mkxmlp1.c.chk ~/doc/xml/.prules -->$cmp mkxmlp1.c mkxmlp1.c.chk ~/doc/xml/.prules -->$ -------------------------------------------------------------------- Figure 2.23 Results of a sample run of mkxmlp1 when its input file is its source file

2.1   Express each item in struct tblock in terms of numt and MLTOKB. Your answer (for each item) should be a macro or a function.

2.2   Version one does not use two items in struct tblock. Which two?

2.3   Suppose I had designed the MTB so its tokens began immediately after the header and worked up the block, and its data elements began at the end of the block and worked down. What functions would I have needed to do differently? Make the necessary changes to them. Do you see why I put the tokens last?

2.4   What are the advantages in defining whsp[] the way I did in Figure 2.4 instead of as `whsp[] = " \t\n\r "'? The later definition puts the four white space characters in ASCII order.

2.5   When can fi be zero in the call to linkrfw in mkreport?

2.6   Suppose you are using mfillbuf in a program that prepares an HTML file. Write one line of code, using mfillbuf, that will make the first line in your file.

2.7   If mfillbuf passes its first test, can it ever return 0?

[summary]   [detail]   [first chap]   [cur chap]