Is there a correct way to process Cyrillic texts using awk?

Z

zradeg2019-09-24 19:38:14

linux

zradeg, 2019-09-24 19:38:14

There are two files: country.csv and president.csv
country.csv has two columns: 1) Country name; 2) The population of
president.csv is also two columns: 1) Country name; 2) Its president's name
A semicolon is used as a separator.
You need to get a third file (or add a column to the first one - that's not the point), where all three fields will be in one line: Country name; Number of population; President's name.
The number of lines in the files is different, i.e. some countries may not be in both the first and second file, i.e. just sorting and then blindly joining the column will not work. It is necessary to find the line with this value in the second file by the value of the first cell of the first file and take the value from the second column of this line.
I'm trying to do it with a script like this:

#!/bin/bash
                                                     
while read LINE; do
        C_NAME=$(echo $LINE | cut -d";" -f1)
        awk -v country=$C_NAME -v line=$LINE -F";" '$1 == country {print line";"$2}' president.csv >>result.csv
done < country.csv

And I get an error message:

awk: cmd. line:1: Albania
awk: cmd. line:1: ^ invalid char '�' in expression

How to get out of the situation?
PS I forgot to mention that both files are already in utf-8!

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

D

DevMan, 2019-09-24
@zradeg

your code is correct, the gag is most likely in the data.
If you fill in csv somewhere, you can look at the thread in more detail.

A

Andrey Dugin, 2019-09-24
@adugin

iconv -f cp1251 -t utf8 president.csv | awk ...

V

vreitech, 2019-09-24
@fzfx

make sure your csv file is without BOM .

Z

zradeg, 2019-09-25
@zradeg

It was my carelessness and... an incorrect line terminator! \r\n instead of \n
Sorry and thank you to everyone who has shown interest in the problem!