User Tag List

+ Reply to Thread
Results 1 to 1 of 1
  1. #1
    Lazy Bum
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    LCS's Avatar
    Join Date
    Nov 2005
    Last Online
    08-14-2014 @ 02:59 PM
    Location
    New York
    Posts
    1,737
    Thanks
    0
    Thanked 3 Times in 3 Posts

    Blog Entries
    2

    XHTML5 White-space Handling Algorithm

    So I have a server now and a placeholder page for my site at [Only registered and activated users can see links. ]. I'm currently working on an XML-based template system to serve my pages. Naturally, the XML templates need to be human-readable and will be neatly indented. However, the output is only ever going to be used by a browser to render markup and so in the process of applying the template to produce the output page, I'd like to strip as much white-space out without altering how the page is rendered.

    So far, I think I've got a pretty good algorithm. If you'd take a look at the source code of my current site's main page…
    HTML Code:
    <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:svg="http://www.w3.org/2000/svg"><head><meta charset="utf-8"/><title>KenUn.Li</title><style>/* General */ .icon { max-height: 1.5em; max-width: 1.5em; } svg.graphicastext { max-height: 1em; } /* Page Layout */ html, body { height: 100%; } body { margin: 0; padding: 0; }</style><style>/* Page Theme */ body { background-attachment: scroll; background-color: hsl(0, 0%, 96.875%); background-clip: border-box; border-style: none; color: black; font-family: Calibri, Tahoma, Geneva, Verdana, sans-serif; font-size: medium; }</style></head><body><table id="doccontainer"><tr><td/><td id="navarea"><nav><ul><li><a title="Home" href="http://kenun.li/" onclick="javascript:window.alert(&quot;You’re already here, silly!&quot;);return false;"><img class="icon" src="http://r.kenun.li/media/images/icons/sitearea/home.svg" alt=""/> Home</a></li></ul></nav></td><td/></tr><tr><td id="footer" colspan="3"><footer><p>KenUn.Li Copyright © 2011 Kevin Li</p></footer></td></tr></table></body></html>
    (The code was cut down in length for brevity.)



    The problem lies in the fact that the white-space culling algorithm doesn't know when it can get away with stripping all of the white-space and when it can't. For example, it would be perfectly fine to strip all the white-space between and within tags inside of <head></head>, but not inside of <pre></pre>. (I explicitly declared xml:space="preserve" for those tags to keep my algorithm from clobbering the content.) Currently, all sequences of white-space characters are reduced to one single space in between sibling elements and within text nodes with the exception of elements marked by xml:space="preserve". All other white-space characters are deleted outright. In summary:
    PHP Code:
    $text preg_replace('/[\n\r\t ]{2,}/'' '$textNode->wholeText); 
    Now I'm rewriting the code (the entire site since the template system was inadequate) and I want to make this text node culling algorithm smarter by actually behaving as a standard browser would in its treatment of white-space. I've got a few ideas already:
    • Keep all of the white-space characters in textarea, pre, and possibly other elements I don't know about.
    • Within inline elements, reduce all white-space characters to a single space character.
    • White-space characters not immediately within an inline element are deleted.
    • White-space characters immediately after a start tag and before a closing tag are deleted.
    • Within block elements, delete all white-space characters.
    Any advice before I implement this?

    EDIT: Here's my terrible code from the previous batch of PHP files:
    PHP Code:
        /*
         *  TO DO:
         *  1. Disable under certain circumstances. For example, sometimes a template:condition may evaluate to nothing leaving two whitespaces in place of what should be one.
         */
        
    function cullTextNode($textNode) {
            if (
    $textNode->parentNode->getAttribute('xml:space') == 'preserve') {
                return 
    $textNode->wholeText;
            }
            else {
                
    $previousSibling $textNode->previousSibling;
                
    $nextSibling $textNode->nextSibling;
                if (
    $textNode->isWhitespaceInElementContent())
                            return 
    null;
                else {
                    
    $text preg_replace('/[\n\r\t ]{2,}/'' '$textNode->wholeText);
                    if (
    $previousSibling == null)
                        if (
    $nextSibling == null)
                            return 
    trim($text);
                        else
                            if (
    $nextSibling->nodeType == 1)
                                return 
    ltrim($text);
                            else
                                return 
    trim($text);
                    else
                        if (
    $previousSibling->nodeType == 1)
                            if (
    $nextSibling == null)
                                return 
    rtrim($text);
                            else
                                if (
    $nextSibling->nodeType == 1)
                                    return 
    $text;
                                else
                                    return 
    rtrim($text);
                        else
                            if (
    $nextSibling == null)
                                return 
    trim($text);
                            else
                                if (
    $nextSibling->nodeType == 1)
                                    return 
    ltrim($text);
                                else
                                    return 
    trim($text);
                }
            }
        } 
    It even ravages the text within script and style elements indiscriminately (which I should have put in a CDATA section anyway).
    Last edited by LCS; 01-27-2012 at 01:51 AM.
    Website:
    To view links or images in signatures your post count must be 5 or greater. You currently have 0 posts.

+ Reply to Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts