Linux - How to write a string in a different encoding to a string object in c++?

G

Gena2014-03-23 21:50:22

linux

Gena, 2014-03-23 21:50:22

Hello, the essence of the question is that it is not possible to compare two identical strings.

The first string (let it be strig s1 = "name") is transmitted from the FileZilla program via a socket, the second string is written with pens (string s2 = "name"). The lines are exactly the same when output to the console in this way:

printf("s1 = \'%s\', has size %u, and s2 = \'%s\' has size %u\n", s1.c_str(), (unsigned int)s1.size(), s2.c_str(), (unsigned int)s2.size());

I get the following: s1 = 'name', has size 8, and s2 = 'name' has size 4

The lines are output to the console normally, but their size is clearly different.
String comparisons were made as follows:

if(s1 == s2) {
    doSomething();
}

and

if(strcmp(s1.c_str(), s2.c_str()) == 0) {
    doSomething();
}

Has anyone experienced this? What can be done?

OS: kubuntu 13.10

Thanks.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

E

EXL, 2014-03-24
@EXL

Try to cast the strings to the same common encoding and then compare. To do this, you can use the libiconv library:
main.cpp:

#include <iostream>
#include <fstream>
#include <cstdlib>

using namespace std;

#include <iconv.h>

string iconv_recode(const string from, const string to, string text)
{
    iconv_t cnv = iconv_open(to.c_str(), from.c_str());

    if (cnv == (iconv_t) - 1) {
        iconv_close(cnv);
        return "";
    }

    char *outbuf;
    if ((outbuf = (char *) malloc(text.length()*2 + 1)) == NULL) {
        iconv_close(cnv);
        return "";
    }

    char *ip = (char *) text.c_str(), *op = outbuf;
    size_t icount = text.length(), ocount = text.length()*2;

    if (iconv(cnv, &ip, &icount, &op, &ocount) != (size_t) - 1) {
        outbuf[text.length()*2 - ocount] = '\0';
        text = outbuf;
    } else {
        text = "";
    }

    free(outbuf);
    iconv_close(cnv);

    return text;
}

void compare_strings(const string &aString1, const string &aString2) {

    cout << "String 1: " << aString1 << endl
         << "String 2: " << aString2 << endl;

    if (aString1 == aString2) {
        cout << "Identical strings!" << endl
             << "-----" << endl;
    } else {
        cout << "Different strings!" << endl
             << "-----" << endl;
    }
}

int main()
{
    ifstream file_1("word_1.txt");  // The "Proverka" Word in UTF-8
    ifstream file_2("word_2.txt");  // The "Proverka" Word in CP1251
    string word_1, word_2;

    file_1 >> word_1;
    file_2 >> word_2;

    compare_strings(word_1, word_2);

    word_2 = iconv_recode("CP1251", "UTF-8", word_2);

    compare_strings(word_1, word_2);

    return 0;
}

exl@exl-Lenovo-G560e:~/SandBox/text_enc > enca -L russian  word_1.txt 
Universal transformation format 8 bits; UTF-8
  Doubly-encoded to UTF-8 from ISO-8859-5
exl@exl-Lenovo-G560e:~/SandBox/text_enc > enca -L russian  word_2.txt 
MS-Windows code page 1251
  LF line terminators
exl@exl-Lenovo-G560e:~/SandBox/text_enc > cat word_1.txt 
Проверка 
exl@exl-Lenovo-G560e:~/SandBox/text_enc > cat word_2.txt 
��������
exl@exl-Lenovo-G560e:~/SandBox/text_enc > ./text_coding 
String 1: Проверка
String 2: ��������
Different strings!
-----
String 1: Проверка
String 2: Проверка
Identical strings!
-----

S

s0L, 2014-03-24
@s0L

I think it's not the encoding, otherwise you wouldn't be able to see the same printout of "name" in both cases. Most likely there is something else in the line, for example, due to incorrect code for receiving data from the socket.

B

bogolt, 2014-03-24
@bogolt

I encountered a similar problem - most likely you are writing incorrectly to a utf8 string. Instead of "name" you have "0n0a0m0e" there, that is, for each character there are 2 bytes instead of one.
For a solution - take any library for working with utf8-16 and make sure that both strings are in the same encoding. As the simplest option, if my assumption with zeros is confirmed, you can simply throw them out with your hands (unless, of course, you have only ascii everywhere).