How to get rid of unicode characters in a string?

D

DYLAN2019-02-16 13:09:16

PHP

DYLAN, 2019-02-16 13:09:16

Hello everyone, I ran into the following problem:
There is XML, there are a lot of articles in it, all this is quietly processed, but as soon as a string that contains Unicode characters, such as
x0f or x0e, enters xml, SimpleXML throws an exception that it is impossible to read the file
Example:
<Value> (Fig. 1) Value>
https://www.freeformatter.com/xml-formatter.html - here you can even test it.
I tried to fight with something like this: Well, or in a cycle, run all str_replace over all "bad" characters, but the text is large and str_replace for some reason does not replace everywhere. As an option, write your own character substitute, but maybe someone has already come across this? The problem fragment is written here
$content = str_replace("\x0F", "", $content);

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vapaamies, 2019-02-16
@vapaamies

Unicode has nothing to do with it at all. The snippet you provided contains characters with decimal code 15 (U+000F). They are equally encoded in all ASCII/ISO compatible encodings (but not in EBCDIC), changing the encoding won't help.
The presence of such characters may indicate text imported from some old (DOS) program that used the codes of an Epson-compatible printer, or a sloppy (machine) import from a binary format, in which form service characters leaked along with the text.
In general, web documents should not contain characters with a code less than a space (newlines and tabs do not count). You need to decide whether to replace the characters with spaces or just delete them, and then process the incoming text, fixing all problematic characters “without looking”:

$text = preg_replace('/[\x01-\x08\x0B\x0C\x0E-\x1F]/', ' ', $text);  // меняем на пробел