Processing Text in PHP with Regular Expression

Y

Yuri Pikhtarev2011-04-07 20:28:48

PHP

Yuri Pikhtarev, 2011-04-07 20:28:48

Good day.

Actually I ran into one problem, which I can not solve on my own. It is necessary to process the text entered by users using PHP to pass it to the Sphinx search engine. The search is performed on all available words, not a phrase. To do this, add the + operator before each individual word.

For this purpose I use the following function:

function clean_text_match ($text, $all_words)
{
  global $db, $bb_cfg;

  $text = ' '. mb_strtolower($text, 'UTF-8') .' ';

  if ($all_words)
  {
    $text = preg_replace('#\s(\b\w)#', ' +$1', $text);
  }
  $text_match_sql = $db->escape(trim($text));

  return $text_match_sql;
}

The essence of the problem is the incorrect processing of the incoming text by a regular expression. For example, let's say that a user searched for the movie "Sex and the City". When searching for the phrase "sex and the city", in the echo of the text processed by the function, we see that it was what it was - it remained that way:

секс в большом городе

We enter the corresponding phrase in English and as a result we have what was intended:

+sex +and +the +city

As you can see, the Russian text does not pass through the regular expression for some reason I do not understand. With English - everything is fine. All text phrases for processing are received in the required encoding (UTF-8) and, in principle, there should not be any problems with the text itself. Therefore, the problem is in the regular expression itself.

Let's simplify it a little to the following structure:

...
  if ($all_words)
  {
    $text = preg_replace('#\s#', ' +$1', $text);
  }
...

We run the Russian text:

+секс +в +большом +городе +

It seems like everything is fine (except for the last space also received +). However, if I want to use other Sphinx operators, such as the NOT operator (! or -), then as a result of running the text with such a negation (we negate the word city), we will have the following:

+секс +в +большом +-городе +

Which is incorrect, because ideally, in case of negation, we should have the following text:

+секс +в +большом -городе

Searches for a similar regular expression that can break words with certain characters, skipping those that are already preceded by a non-whitespace character (@,!,-, etc.) have not been successful for me. Therefore, I ask for help here: is there a way to implement our plans with a different regular expression?

While searching, I came across this comment: ru2.php.net/manual/ru/regexp.reference.escape.php#102868 - apparently, the \b escape sequence is simply not friendly with Unicode.

Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

E

ertaquo, 2011-04-07
@Exileum

Try the u modifier:
$text = preg_replace('#\s#u', ' +$1', $text);
www.php.net/manual/en/reference.pcre.pattern.modifiers.php

[

[email protected]><e, 2011-04-07
@barmaley_exe

1. To use Unicode, you need the appropriate modifier .
2. The characters affected by \w depend on the locale. Check the locale, or better yet, replace \w with [a-za-yayo].

A

Anatoly, 2011-04-07
@taliban

public function clean_text_match($text, $all_words)
{
	//global $db, $bb_cfg;
	
	$text = ' '. mb_strtolower($text, 'UTF-8') .' ';
	
	if ($all_words)
	{
		$text = preg_replace('#\s(\b\w)#', ' +$1', $text);
	}
	//$text_match_sql = $db->escape(trim($text));
	
	return $text;
}
	
public function aaaAction()
{
	echo $this->clean_text_match( 'sex and the city', true );
}

result: +sex +in the +big +city
Are you sure you're getting the correct string?

B

Begetan, 2011-04-07
@Begetan

I quote, not mine, but I found it and I'm also interested.
PCRE has special sequences for different classes of Unicode characters, such as "\p{L}" for letters, "\p{N}" for numbers, and so on.
…
First write: $text = preg_replace('#\s(\b\pLN)#', ' +$1', $text);
Well, then there are other methods:
bolknote.ru/2010/09/08/ ~2704#29

B

Begetan, 2011-04-07
@Begetan

More precisely, just \pL
And play around
www.php.net/manual/en/regexp.reference.unicode.php
thousand characters. Therefore, traditional PCRE escape sequences such as \d and \w do not use Unicode properties. „