At a Glance: DOM Parsing

The problem: You have some markup that needs to be modified. Maybe it needs extra attributes? Maybe it needs some added markup if certain conditions are met? Whatever the case, the problem sounds easy in principle but can actually become quite complicated depending on the task. Many new developers will fall into the trap of taking the easy way out (jQuery), or using the incorrect tools (Regular Expressions) for a job like this. But what's the harm in using something like jQuery to fix your markup problem? Why are Regular Expressions so damned evil when it comes to navigating through the DOM Tree? Allow me to use a real-world example a StackOverflow user ran into a day ago:

I've got a weird layout to get around and am at a loss, even in the planning stage. Essentially I need to separate out all content that's not a .gallery and put it into an <aside />. I initially considered a plugin using the edit_post hook from the Plugin API, but have since decided against it because this content change is layout specific and I want to maintain a clean database. So... How can I parse through WP's the_content for content that's not .gallery? [...] here's an example of WP's the_content class output: HTML
<div class="entry-content">
<div class="gallery">
<dl class="gallery-item">
    <dt class="gallery-icon portrait"><img class="attachment-thumbnail" src="/imagePath/etc.jpg" /></dt>
</dl>

<dl class="gallery-item">
    <dt class="gallery-icon portrait"><img class="attachment-thumbnail" src="/imagePath/etc.jpg" /></dt>
</dl>

<dl class="gallery-item">
    <dt class="gallery-icon portrait"><img class="attachment-thumbnail" src="/imagePath/etc.jpg" /></dt>
</dl>
</div>

<p>Curabitur vulputate, ligula lacinia scelerisque tempor, lacus lacus ornare ante, ac egestas est urna sit amet arcu. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Sed molestie augue sit amet.</p>

<ul>
    <li>Item A</li>
    <li>Item B</li>
    <li>Item C</li>
</ul>
</div>
Desired Output
<div class="entry-content">
<div class="gallery">
<dl class="gallery-item">
    <dt class="gallery-icon portrait"><img class="attachment-thumbnail" src="/imagePath/etc.jpg" /></dt>
</dl>

<dl class="gallery-item">
    <dt class="gallery-icon portrait"><img class="attachment-thumbnail" src="/imagePath/etc.jpg" /></dt>
</dl>

<dl class="gallery-item">
    <dt class="gallery-icon portrait"><img class="attachment-thumbnail" src="/imagePath/etc.jpg" /></dt>
</dl>
</div>

<aside>
<p>Curabitur vulputate, ligula lacinia scelerisque tempor, lacus lacus ornare ante, ac egestas est urna sit amet arcu. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Sed molestie augue sit amet.</p>

<ul>
    <li>Item A</li>
    <li>Item B</li>
    <li>Item C</li>
</ul>
</aside>
</div>
</blockquote>

This single question covers the entire scope of tools a Developer will be tempted to use when confronted with a problem like this. So going back to my earlier questions, why would Javascript/jQuery and Regex be BAD for solving this?

  1. When incorrect data is being generated on the server-side, it's bad practice to attempt to correct it on the client-side.
  2. The desired result involves wrapping content in an <aside> tag. This particular HTML5 tag has benefits in the realm of SEO and helping provide context for crawlers. When correcting an issue that's generated on the server-side by leveraging tools on the client-side, search engines won't be able to see those changes, and can potentially hurt your Search Engine rankings.
  3. Regular Expressions are not bullet-proof. If you're using Regex to parse HTML, YOU'RE DOING IT WRONG. A simple explanation (though also misguided in its own right) is that Regular Expressions can only match Regular Languages, which HTML is not. A better (albeit more complex) answer can be found here.

But I'm going off-topic here. There are some wonderful articles on why parsing HTML with Regular Expressions is a bad idea. You'll also find a good bit of laughter in watching the very idea drive a StackOverflow user to the brink of madness. But let's go back to the problem at hand. Using PHP's built-in DOM Parser will save you a lot of heartache, but understand that there's also going to be a bit of a learning curve. I invite you to read about all of the Documentation on DOMDocument, but in the meantime, I'll break down the answer to the above problem:

add_filter('the_content', 'wrap_nongallery_aside', 20); //Listen for the_content
function wrap_nongallery_aside($content){ //Our filter callback
    $dom = new DOMDocument(); //Create an instance of DOMDocument
    $dom->loadHTML($content); //Load our content's HTML
    $aside = $dom->createElement('aside'); //Create an aside element
    $xpath = new DOMXPath($dom); //Use XPath to Navigate the Tree
    $not_gallery = $xpath->query('//div[@class="entry-content"]/*[not(contains(@class, "gallery"))]'); //Grab all top-level elements inside the entry-content div that do not have a class of "gallery"

    foreach($not_gallery as $ng) $aside->appendChild($ng); //Loop through the results and append them to the aside element
    $dom->getElementsByTagName('div')->item(0)->appendChild($aside); //Append the aside element to the entry-content div
    return $dom->saveHTML(); //Return the modified HTML
}

In short: Parsing HTML is a solved problem. Using the correct tools for this job will make your life a LOT easier.

Written by maiorano84 on Wednesday February 18, 2015
Permalink -

«The Golden Age of PHP Frameworks