S
S
Stepan Zubashev2012-06-13 11:33:18
Regular Expressions
Stepan Zubashev, 2012-06-13 11:33:18

What should be the regular expression to replace relative links with absolute ones (RSS)?

Good day. I really love regular expressions, but quite often I swim in them when I solve not the most trivial task. And now ... I have rss, and my task is to replace all relative paths in it with absolute ones. I came up with the following solution:

static public function relative_to_absolute( $text )
  {
    $rg =
      '#'.
        '(<\w+\s.*)'. // "<img "
        '(href|src)\s*=\s*'. // "src = "
        '(?:'.
          '\'([^\']+)\''.'|'. // 'relative'
          '\"([^\"]+)\"'. // "relative"
        ')'.
      '#';

    $replace = $text;
    $index = Kohana::$base_url !== '/';

    do
    {
      $text = $replace;

      $replace = preg_replace_callback( $rg, function( $m ) use ( $index )
      {
        $url = empty( $m[ 3 ] ) ? $m[ 4 ] : $m[ 3 ];

        if( strpos( $url, '//' ) === false )
        {
          return $m[ 1 ].$m[ 2 ].'="'.URL::site( $url, 'http', $index ).'"';
        }
        else
        {
          return $m[ 0 ];
        }
      }, $replace );
    }
    while( $text !== $replace );

    return $replace;
  }

It works, but I don't like it for the following reasons:
1. Hard-wired attributes. It seems to be not critical, but I want a more universal solution. The problem is how to distinguish a relative url from just an attribute value.
2. Duplication of the section of code that is responsible for the link itself in quotes, in view of the fact that they can be both ' and ". If you write something like '|" then what about links containing a quote (for example, transliteration from "b").
3. The regular skips all urls, and already inside determines which are relative and which are not. How can you filter out absolute ones at the preg_replace stage? I guess that a retrospective negative check is needed there, but I don’t understand how to stick it there.
In general, it seems to me that this task can be solved without _callback. I'm not interested in a ready-made solution, but in understanding how it works :)

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
Stepan Zubashev, 2012-06-13
@faiwer

JavaScript lacks a lot of support for regular expressions. Only the bare minimum. I know about the retrospective negative check, I myself wrote about it above :) I still haven’t figured out how to deploy it here. But I came up with a solution with a forward positive check. It turned out like this:

  static public function relative_to_absolute( $text )
  {
    $rg =
      '#'.
        '(<\w+\s.*)'. // "<img "
        '(href|src)\s*=\s*'. // "src = "
        '(?:'.
          '\'(?![\w:]+//)/?([^\']+)\''.'|'. // 'relative'
          '\"(?![\w:]+//)/?([^\"]+)\"'. // "relative"
        ')'.
      '#';

    $replace = $text;
    $host = self::base('http');

    do
    {
      $text = $replace;
      $replace = preg_replace( $rg, '$1$2="'.$host.'$4"', $replace );
    }
    while( $text !== $replace );

    return $replace;
  }

(?![\w:]+//) - “says” that there will be no protocol after the quote,
/? - ignores the leading slash in case there is one, because $host already has it.
The only problem left is the duplication of code for single and double quotes. But here, to be honest, it seems to me that there are no solutions in the forehead :)

A
Anatoly, 2012-06-13
@taliban

index = "http://index.com"; "<img src='/relative'>".replace(/src='(\/.*?)'/, "src='"+index+"$1")

I think it will not be difficult to translate the code from JavaScript =)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question