Stripping images out of an HTML string in PHP

This is some wacky stuff, but I needed it for this blog so I didn't have to work super duper hard to reduce the front page size.

Basically, I have images in blog posts, and removing them so you don't have to load 30MB of images when you go to the main page is imperative. I was accomplishing this using an ugly regex, but it was buggy and didn't let me to arbitrary processing on each image node.

So I did what any sane person does, and wrote a processing function to traverse the post DOM. I author all of my posts in markdown and the upside is that it's easy to read the source, and it gets turned into super clean HTML. The blog software does the heavy lifting of normalizing each post into pure HTML for me, and I can run code on the output.

For this, I used the PHP DOMDocument stuff which is horrible, don't get me wrong, but it's significantly better than any other path I found.

The Code

<?php
function contents($contents, $stripImages) {
    // create a document obj
    $document = new DOMDocument();

    // we don't care about whitespace
    $document->preserveWhiteSpace = false;

    // load html with some flags set since we don't have some of the necessary tags
    $document->loadHTML($contents, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

    // find all of our images
    $image_nodes = $document->getElementsByTagName("img");

    // loop through them *backwards* in case we need to delete any
    for($pos=$image_nodes->count() - 1; $pos>=0; $pos-=1) {
        // save the real image to a var, we'll use that later
        $real_image = $image_nodes->item($pos);

        // clone it to get a var we can work with
        $img = $real_image->cloneNode(true);

        // pull the src attribute
        $src = $img->getAttribute("src");

        // and save it in case we need to change it (thumbnails)
        $og_src = $src;

        // pull the alt attribute
        $alt = $img->getAttribute("alt");
        $altRepl = $alt;

        // create a container
        $image_container = $document->createElement("span");
        $image_container->setAttribute("style", "display:inline-block;");

        // check for thumbnail
        if (strtolower($alt) === "thumb" || substr(strtolower($alt),0,6) === "thumb:") {
            // for thumbnails, remove the "thumb" text and pull out the actual alt text
            if (strtolower($alt) === "thumb") {
                $alt = '';
            } else {
                $alt = mb_substr($alt, 6);
            }

            // explode the link on slashes, add "thumbnails/" to the last item, re-implode it
            $path = explode("/", $src);
            $path[count($path)-1] = "thumbnails/" . $path[count($path)-1];
            $src = implode("/", $path);

            // create the parent link
            // use it instead of the container we made above
            $image_container = $document->createElement("a");
            $image_container->setAttribute("href", $og_src);
        }

        // set our src and alt in case we changed them.
        $img->setAttribute("src", $src);
        $img->setAttribute("alt", $alt);

        // add the image tag to our containing element
        $image_container->appendChild($img);

        // set tooltip info
        $image_container->setAttribute("t-pos", "b");
        $image_container->setAttribute("tip-size", "medium");
        $image_container->setAttribute("aria-label", $alt);

        if ($stripImages) {
            // if we're stripping the images, create a text node instead
            $image_container = $document->createTextNode("[Image: " . $img->getAttribute("alt") . "]");
        }

        // replace the real image with the new container, which may just be a text node
        $real_image->replaceWith($image_container);
    }

    // return a string of the document structure
    return $document->saveHTML();
} 
?>

There's some things specific to this blog in there that you'd want to remove if you use this elsewhere, such as the thumbnail linking code and the tooltip code, but overall you could drop this into a lot of things and it should "just work". It's not very error-resilient (because my blog generates HTML that's Good Enough (tm)) but it's just a matter of adding a check here or there.

One major caveat is that, in some circumstances, the DOM library will mangle your html. For example, on blog posts starting with a blockquote, it would wrap the entire contents of the post in a blockquote. Strange behavior, and wrapping the input to loadHTML in <div>(stuff)</div> fixed it.

Stripping images out of an HTML string in PHP

The Code

Photoshop in 2021/2022

CSS mask-image Property

other crap

On sunsetting projects

Discord pop-up ad jumpscare

Bluesky vs Twitter vs Fediverse (Mastodon, Pleroma, Sharkey, Misskey, etc)

The cardinal sins of board games

I added comments so that I can suffer from spambots