Skip to main content
RegEx Corner

Regex Corner: Extract image tags from HTML

Back-end Development

Regular expressions are an invaluable development tool, and also extremely handy for non-developers who need to comb through plain text in an editor. In this article, we'll look at a simple regex problem and dissect a possible solution.

The Problem

We are using Drupal Migrate to import content from another CMS. Much of the incoming content is unstructured markup, and we are trying to automate as much additional structure into the resulting Drupal site as possible (with hand-editing likely to follow).

We notice that the content usually (but not always) begins with a photo of the author, in an image tag. We'd like to extract this from the text and place it in an actual media field in Drupal, so we can apply image styles and so forth.

To accomplish this, we create a custom process plugin for Migrate that accepts the HTML and returns the HTML without the image tag (to place in the new body area), or just the image source (for creating the media element) based on a configuration variable.

The Text

<h2>A Man's Reach Should Exceed His Grasp</h2>

<img src="http://www.ldonline.org/images/people/henrywinkler.jpeg" border="1" alt="Henry Winkler" /> 

<p>So as I&#146;m reading the narration into a tape recorder, it started to dawn on me. I&#146;m not lazy. I&#146;m not stupid. I&#146;m dyslexic..."</p>

The Regex

A single regular expression will handle both of our use cases.

<img[^>]*src="([^"]+)"[^>]*>

Let's break that down a bit.

First, the regular expression looks for the literal string <img. This finds the beginning of an image tag.

Then, we want to skip a bunch of characters until we locate the image's src attribute. This is where we do some trickery. Often when skipping on to the next "special" bit of the text, we think of the catchall .* which matches any character, zero or more times. But since regular expressions are "greedy," this will gobble up as much text as possible. If there are multiple image tags in the text, we would end up with all the text in between the first and last image as part of our match, which is no good. So instead, we use [^>]* to grab zero or more characters which are not the end of a tag.

Next is another easy part: the literal string src=" to begin the src attribute.

Inside capturing parentheses, we have another string of characters. This time, we grab one or more characters that are not quotes, to avoid greedily gobbling the text in-between attributes.

We wrap up by closing the quote, once again matching everything that doesn't end the tag, and the final literal > character.

Putting it in context

The full Migrate plugin looks like this:

/**
 * Finds the first image tag inside text.
 *
 * Available configuration keys:
 * - remove: Whether to remove the tag and return the surrounding text,
 *   rather than returning the image's src attribute.
 *
 * Examples:
 *
 * @code
 * process:
 *   field_story_text:
 *     plugin: extract_first_image
 *     remove: true
 * @endcode
 *
 * @MigrateProcessPlugin(
 *   id = "extract_first_image"
 * )
 */
class ExtractFirstImage extends ProcessPluginBase {

  /**
   * {@inheritdoc}
   */
  public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) {
    $regexp = '%<img[^>]*src="([^"]+)"[^>]*>%';

    if ($this->configuration['remove']) {
      return preg_replace($regexp, '', $value, 1);
    }
    else {
      if (preg_match($regexp, $value, $matches)) {
        return $matches[1];
      }
      throw new MigrateSkipProcessException('No image found.');
    }
  }

}

In the case of remove being true, we strip out the whole matched image tag by using preg_replace(). If on the other hand remove is false, we use the capturing parentheses to yank out just the src attribute, and return that.

With that, we're done!