S
S
Sergey Korolkov2019-01-16 21:04:59
.NET
Sergey Korolkov, 2019-01-16 21:04:59

How to parse text in CSV format ignoring commas inside quotes and without third-party libraries?

For example, there is a string: name, days, company.
Splitting a string is easy with the Split method:

string[] text = File.ReadAllLines("file.csv", Encoding.Default);
            foreach (string line in text)
            {
                string[] words = line.Split(',');
                foreach (string word in words)
                {
                    Console.WriteLine(word);
                }
            }
            Console.ReadKey();

But how to parse if everything that is framed by double quotes is text, even if there are commas inside. For example:
"Sergei, Korolkov", 7 days, Ariel
Maxim, 3 days, "company, Oriflame"
Should output:
Sergey, Korolkov | 7 days | Ariel
Maxim | 3 days | company, Oriflame
But keep in mind that the input data will not always be in an ideal format (as in the example). That is, there may be 3 quotes in a row or a string without commas. The program should not crash in any case. Although If it is impossible to parse, I will issue a message about it.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
E
eRKa, 2019-01-16
@vhelsing90

Could be something like this

data = new List<string>();
parts = source.Split("\"");
data.AddRange(parts.Where((x, index) => index % 2 != 0));
data.AddRange(parts.Where((x, index) => index % 2 == 0).Split(","));
result = string.Join(" | ", data.Where(x => !string.IsNullOrWhiteSpace(x));

But if there are three quotes, then it may not work.

S
Sumor, 2019-01-17
@Sumor

Based on mefutu
's answer Sample solution on finite state machine:

using System;
using System.Text;
using System.Collections.Generic;
          
public class Program
{
  public static void Main()
  {
    Console.WriteLine(string.Join("|", Parse("Мама,\"мыла, блин\", раму,\"мама, мыла \"\"раму\"\"\",конец")));
  }
  
  public enum StateEnum{Start, StartQuot, Inline, InlineQuot}
  
  public static IEnumerable<string> Parse(string str)
  {
    var state = StateEnum.Start;
    var sb = new StringBuilder();
    foreach(var ch in str)
    {
      switch(ch)
      {
        case '"':
          switch(state)
          {
            case StateEnum.Start:
              state = StateEnum.StartQuot;
              continue;
            case StateEnum.StartQuot:
            case StateEnum.InlineQuot:
              state = StateEnum.Inline;
              sb.Append('"');
              continue;
            case StateEnum.Inline:
              state = StateEnum.InlineQuot;
              continue;
          }
          break;
        case ',':
          switch(state)
          {
            case StateEnum.Start:
            case StateEnum.InlineQuot:
              yield return sb.ToString();
              sb.Clear();				
              state = StateEnum.Start;
              continue;
            case StateEnum.StartQuot:
            case StateEnum.Inline:
              sb.Append(',');
              state = StateEnum.Inline;
              continue;						
          }
          goto default;
        default:
          sb.Append(ch);
          break;
      }
    }
    yield return sb.ToString();
  }
}

M
mefutu, 2019-01-16
@mefutu

Well, split won't help you here. Take/make "stateMachine" a simple example here: https://stackoverflow.com/questions/5923767/simple... . Go through each character. Look at the state of the machine and decide whether to include this character in your string or whether it's time to push the string into memory.
Approximate algorithm:
Quote opened;
- save everything in the string
Closed quote;
- waiting for separator character ',' ';'
Ps statemachine may be redundant here, but read for reference.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question