Help me write a regular expression to replace the first occurrence of a word in HTML, but strictly outside of anchors and attributes?

A

Alexey2012-05-30 23:33:46

PHP

Alexey, 2012-05-30 23:33:46

Friends, you need to find and replace the regular expression in the text marked up in HTML:

First occurrence of the given word
Which at the same time is not wrapped inside the tag <a>…</a>(i.e. does not fall into the anchor of any link)
Which at the same time is not part of any attribute (of altthe tag type img)

Below is the text for an example and tests. The replacement should only happen in the fragment "... every cucumber , just now ..." (the first sentence of the last paragraph). All other occurrences of the word "cucumber" violate any of the listed conditions.

<p>Самые вкусные огурцы росли у меня на даче прошлым летом, это был отличный сезон. Когда я их срывал, то уплетал вот так: <br /> <img src="cucumber-eater.png" alt="Я ем огурец"></p>
<p>Кстати, вы знали, что <a href="http://example.com/super-facts/blue-cucumber" title="По ссылке рассказ про то, как вырастить синий огурец">обычный огурец может быть синим</a>? Я вот — не знал, думал, что они все только зелёные.</p>
<p>Лично мне аппетитным кажется каждый огурец, только что сорванный с грядки. Хотя свежий на вид огурец и зимой можно купить в любом супермаркете, я предпочитаю кушать только то, что выращено своими руками под бдительным контролем.</p>

The problem is solved by working with the DOM (the innertexts of all nodes are handled separately using simplehtmldom.sourceforge.net or another parser and replacements are simply not made for the innertexts of elements <a>). But it would be much more convenient to have a solution in the form of a working regular expression (performance is not critical). I can’t master it, because I’m not familiar with regular expressions.

Thank you in advance for your attention to such a non-trivial case.

PS There are a lot of interesting things in the comments. The war of regulars and counterexamples. Thanks to habra -people Jaguar_ko , yui_room9 , dsd_corp for the ensuing battle :-)

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

J

Jaguar_ko, 2012-05-31
@kostin

/(search)(?!.*(?:)|(".*>))/
This is purely theoretical :)
There is no way to check on the phone at one in the morning)
P.S: search is the search word :)

E

Ents, 2012-05-31
@Ents

It is impossible to do this in any regular language. If you are wondering why, read about finite automata (regular expressions are a special case)
Look towards DOM

Q

Quiz, 2012-05-31
@Quiz

When a person encounters a problem, they think "I can easily solve this problem with a regular expression!". Since then, he has had two problems...

D

dsd_corp, 2012-05-31
@dsd_corp

Posted as requested above.
Let's go to this repository .
We drag three files from there: xmlp.inc , progress.inc and cucumbers.zip .
The example for your question is in cucumbers.zip .
xmlp.inc is a DOM type parser.
progress.inc - just a helper, used by the example to measure and display the time of work.
You need to unzip the zip and copy the remaining two files into the resulting directory.
We actually run the example.php
example. The main function you need is: replace_text ()
The first two parameters are clear and so - this is the text to search for and what you are looking for.
The fourth parameter $ignore_tags is an array of tag names we are ignoring. In your case, by convention, it's 'A'. 'IMG' in the example can be excluded from this array - I just added )
The third parameter is what to replace the found occurrences with.
But if this third parameter is false (I made the option this way), then the function will return not the changed string, but an array of offsets of the found occurrences.
The function does not stop at the first valid occurrence - it replaces everything it finds and everything that matches the condition.
If you don't want the function to rule HTML/XML jambs to its own understanding, and at the same time want to replace only counted occurrences, then you can get offsets, and then either in a loop replace everything with PHP's substr_replace function(because you have offsets of occurrences, and you also know the length of the searched string), or replace only the first occurrence by the first offset from the returned array.
In the example of the function frt1() , frt2() and frt3() are identical in functionality, frt1() works recursively, in the rest I got rid of recursion. frt3( ) differs from frt2() only in associative stack indexing (not so dazzling and clearer in the eyes). And since all three of these functions do the same thing, the first two can be removed.
It actually uses frt3() for search and replace and frt4() for getting offsets. Cucumbers.txt
file- this is your example, in cucumbers1.txt I stuffed more cucumbers in different places)))
These files are used as input, well, you'll figure it out there, you can see everything in the code.
The results of the work of the instance are also spat into files, you will see them in the same directory after the script has been processed.
There will be questions - ask.