So I have a server now and a placeholder page for my site at [Only registered and activated users can see links. ]. I'm currently working on an XML-based template system to serve my pages. Naturally, the XML templates need to be human-readable and will be neatly indented. However, the output is only ever going to be used by a browser to render markup and so in the process of applying the template to produce the output page, I'd like to strip as much white-space out without altering how the page is rendered.
So far, I think I've got a pretty good algorithm. If you'd take a look at the source code of my current site's main page…
(The code was cut down in length for brevity.)HTML Code:<?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:svg="http://www.w3.org/2000/svg"><head><meta charset="utf-8"/><title>KenUn.Li</title><style>/* General */ .icon { max-height: 1.5em; max-width: 1.5em; } svg.graphicastext { max-height: 1em; } /* Page Layout */ html, body { height: 100%; } body { margin: 0; padding: 0; }</style><style>/* Page Theme */ body { background-attachment: scroll; background-color: hsl(0, 0%, 96.875%); background-clip: border-box; border-style: none; color: black; font-family: Calibri, Tahoma, Geneva, Verdana, sans-serif; font-size: medium; }</style></head><body><table id="doccontainer"><tr><td/><td id="navarea"><nav><ul><li><a title="Home" href="http://kenun.li/" onclick="javascript:window.alert("You’re already here, silly!");return false;"><img class="icon" src="http://r.kenun.li/media/images/icons/sitearea/home.svg" alt=""/> Home</a></li></ul></nav></td><td/></tr><tr><td id="footer" colspan="3"><footer><p>KenUn.Li Copyright © 2011 Kevin Li</p></footer></td></tr></table></body></html>
The problem lies in the fact that the white-space culling algorithm doesn't know when it can get away with stripping all of the white-space and when it can't. For example, it would be perfectly fine to strip all the white-space between and within tags inside of <head>…</head>, but not inside of <pre>…</pre>. (I explicitly declared xml:space="preserve" for those tags to keep my algorithm from clobbering the content.) Currently, all sequences of white-space characters are reduced to one single space in between sibling elements and within text nodes with the exception of elements marked by xml:space="preserve". All other white-space characters are deleted outright. In summary:
Now I'm rewriting the code (the entire site since the template system was inadequate) and I want to make this text node culling algorithm smarter by actually behaving as a standard browser would in its treatment of white-space. I've got a few ideas already:PHP Code:$text = preg_replace('/[\n\r\t ]{2,}/', ' ', $textNode->wholeText);
Any advice before I implement this?
- Keep all of the white-space characters in textarea, pre, and possibly other elements I don't know about.
- Within inline elements, reduce all white-space characters to a single space character.
- White-space characters not immediately within an inline element are deleted.
- White-space characters immediately after a start tag and before a closing tag are deleted.
- Within block elements, delete all white-space characters.
EDIT: Here's my terrible code from the previous batch of PHP files:
It even ravages the text within script and style elements indiscriminately (which I should have put in a CDATA section anyway).PHP Code:/*
* TO DO:
* 1. Disable under certain circumstances. For example, sometimes a template:condition may evaluate to nothing leaving two whitespaces in place of what should be one.
*/
function cullTextNode($textNode) {
if ($textNode->parentNode->getAttribute('xml:space') == 'preserve') {
return $textNode->wholeText;
}
else {
$previousSibling = $textNode->previousSibling;
$nextSibling = $textNode->nextSibling;
if ($textNode->isWhitespaceInElementContent())
return null;
else {
$text = preg_replace('/[\n\r\t ]{2,}/', ' ', $textNode->wholeText);
if ($previousSibling == null)
if ($nextSibling == null)
return trim($text);
else
if ($nextSibling->nodeType == 1)
return ltrim($text);
else
return trim($text);
else
if ($previousSibling->nodeType == 1)
if ($nextSibling == null)
return rtrim($text);
else
if ($nextSibling->nodeType == 1)
return $text;
else
return rtrim($text);
else
if ($nextSibling == null)
return trim($text);
else
if ($nextSibling->nodeType == 1)
return ltrim($text);
else
return trim($text);
}
}
}



LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks