Monday 26 May 2008 by

For those that just went… “Tag what?!”, the tag parser is the part of TXP that interprets the TXP tags in your carefully crafted forms and pages.

Although we generally call the entire thing a “tag parser”, the actual parsing can be divided into three separate parts, all of which have been improved since TXP 4.0.6:

  1. tag parser: finds the TXP tags in the forms and pages (and articles if preferred)
  2. attribute parser: detects the attribute key/value pairs specified in those TXP tags.
  3. if/else parser: needed to correctly handle the <txp:else /> “tag”.

This is the first of two articles discussing the parser changes in the upcoming TXP 4.0.7 version (no, not next week/month). The second article will discuss parsing speed.

Tag nesting

Similar to (X)HTML, TXP allows you to nest tags, for (not so great) example:

<txp:if_category>
  <txp:if_section>
    section and category
  <txp:else />
    only category
  </txp:if_section>
<txp:else />
  only section
</txp:if_category>

And that works fine, even in older TXP versions, as long as the nested tag was not the same as the enclosing tag. Nesting a tag within the same tag, as shown in the example below, wasn’t possible:

<txp:if_section name="article">
  My Articles
<txp:else />
  <txp:if_section name="about">
    About this website
  <txp:else />
    Um, something else.
  </txp:if_section>
</txp:if_section>

The tag parser would detect the first </txp:if_section> tag as the closing tag for the first <txp:if_section> tag, which would then break the parsing of the second <txp:if_section> tag, which in turn caused a warning about an unknown <txp:else /> tag. This is a limitation of the old parser, which was very fast and efficient, but relied purely on the use of a single regular expression. While regular expressions are generally the greatest thing since chocolate covered raisins and apple pie, they are not good for parsing nested structures.

Enter the new parser. It no longer has any limitations on tag nesting, so the second example given above now works fine. And as before, there are no limits on nesting depth.

Attribute value escaping

Most TXP tags allow you to specify attributes as key/value pairs to override default behaviour. The attribute values must always be delimited by a pair of double (or single) quotes:

<txp:tag key="value" />

Suppose you wanted to use this as an attribute value: a “good” example:

<txp:tag key="a "bad" example" />

You can probably see the problem: the attribute parser would assume that the attribute value ended at the second quotation mark, setting the value for the key to “a “, ignoring the rest of the attribute value.

In the past you could only solve this by using single quotes to delimit the attribute value:

<txp:tag key='a "good" example' />

While this works for this particular situation, it wasn’t a perfect solution, because didn’t account for attribute values that contain both single and double quotes. It also failed to handle attribute values containing a >, because before the attribute parser had a chance to deal with it, the tag parser had already interpreted it as the end of a TXP tag.

The new parser solves this by being aware of attributes during tag parsing and treating a duplicate delimiter character inside an attribute value as a literal character instead of a delimiter. The following examples all work in the new parser:

<txp:tag key="a ""quoted"" word" />

That would set the ‘key’ attribute to contain the value: a “quoted word.

<txp:tag key='a ''quoted'' word' />

That would set the ‘key’ attribute to contain the value: a ‘quoted’ word.
If you look closely, you’ll see that in the attribute value, there are two single quotes on each side of the word ‘quoted’, not a double quote.

One last example that shows what’s possible now:

<txp:tag key="let's use ""double"" quotes & <html> here" />

So the only character that needs escaping is the delimiter (double quote here).

Attribute value parsing

In most cases, you want attribute values to be treated as just a string of text, but there are situations where it can be useful to parse the attribute value itself. Given the popularity of the asy_wondertag plugin, we’ve enabled attribute value parsing for single quoted attribute values.

Double quoted attribute values are not parsed, so if your attribute value contains a value that looks like a TXP tag, but should be treated as literal text, you must always use double quotes. In fact, you should use double quotes to delimit attribute values at all times, unless you want the attribute value parsed. The reason for this is simple: speed. Parsing an attribute value is slower than treating it as plain text.

What does this all mean? Well, let’s give a few examples, starting with attribute values that are not parsed:

<txp:tag key="plain text" />
<txp:tag key="literal <txp:tag />" />

In the above examples, the attribute are treated as plain text; the literal TXP tag is not parsed. If you wanted the TXP tag in the attribute value to be parsed, you should write it like this:

<txp:tag key='parsed <txp:tag />' />

Let’s take a real-world example, using an article that has a custom field named ‘email’ containing an email address me@example.com and a custom field ‘name’ containing my name:

<txp:email
  email='<txp:custom_field name="email" />'
  linktext="Send email"
  title='Send email to <txp:custom_field name="name" />'
/>

Because the single quoted attribute values are parsed, after parsing the attribute values, it looks like this:

<txp:email 
  email="me@example.com" 
  linktext="Send email"
  title="Send email to Ruud"
/>

If it were just one article, you wouldn’t need attribute parsing, but if you have many articles with different email addresses in such a ‘email’ custom field, this can be very useful.

Attribute value parsing has no real limitations. Within a parsed attribute value, you can:

  • have an unlimited number TXP tags.
  • mix plain text with TXP tags.
  • use container tags (yes, even <txp:php>), self-closing tags and if/else constructs.
  • only for die-hard users: even the attributes of tags inside an attribute can be parsed to unlimited depth, provided you use proper attribute value quoting and escaping. Now if only someone could find a practical use for this…

Deprecated use and backwards incompatibility

Contrary to earlier parser versions, attribute values must always be delimited by double (or single) quotes. Using non-quoted attribute values (<txp:tag wrong=non-quoted />) is deprecated but still works. It’ll give a warning in debug/testing mode, so you can fix it, to avoid problems in future versions that will no longer accept non-quoted attribute values.

Multiple attribute key/value pairs must be separated by whitespace.
Wrong: <txp:tag key1="value"key2="value" />
Good: <txp:tag key1="value" key2="value" />

Backslash stripping (PHP stripslashes) is no longer performed on attribute values. It had no practical use anyway and few people will notice this, because it only affects backslashes. If you want to use a backslash in an attribute value, there’s no longer a need to insert it twice. Once is enough.