I
I
Ilya Rodionov2017-11-23 19:10:16
Python
Ilya Rodionov, 2017-11-23 19:10:16

How to sort different phone numbers?

Hello!
There was a question: there is a file with >10000 lines.
.txt
The file contains data in the form:

First Name Last Name : 899999999
- phone number
ie the file consists of first name, last name and phone number.
but the fact is that this number does not always look like 8999 ...
sometimes it is written as +7 ...
sometimes - 7 ....
sometimes 99999999 (that is, without 8/7/+7)
sometimes the phone number is written even so (!) 8-nine-123-zero-123
and sometimes instead of a phone it says "Ivanov Ivan: no phone (there is a phone)" or any other text.
so the question arises: how to teach a python (?) to choose what is a phone and what is not.
I understand how this could be implemented, having only 1 number format - 8999 ... or 7999 ..
but when a) - they are different b) - not always pure in digital form (9-zero-12) c) not always in general phones - here I get confused.
there was an idea to train a python using neural networks (?), but I don’t understand this at all.
and there are no libraries that work with phone numbers, as I understand it.
Thank you in advance for your answer!

Answer the question

In order to leave comments, you need to log in

3 answer(s)
E
Egor Stakhovsky, 2017-11-23
@ySky

I can’t suggest any library for working with numerals in Russian (and in any other language), but have you thought about making the parser simpler?
Something like:

import re


REPLACEMENT = {
  'ноль': '0',
  'один': '1',
  'два': '2',
  'три': '3',
  'четыре': '4',
  'пять': '5',
  'шесть': '6',
  'семь': '7',
  'восемь': '8',
  'девять': '9'
}


PHONE_REGEX = re.compile('(\+)?\d{10,11}')


def parse_phones(file_path):
  parsed = []
  unparsed = []
  with open(file_path, 'r') as file:
    for line in file:
      name, phone, *_ = line.split(':')
      name = name.strip()
      phone = phone.strip()
      for key, value in REPLACEMENT.items():
        phone = phone.replace(key, value)
      if PHONE_REGEX.match(phone):
        phone_len = len(phone)
        if phone_len == 10:
          phone = '+7' + phone
        elif phone_len == 11:
          phone = '+7' + phone[1:]
        parsed.append((name, phone))
      else:
        unparsed.append(line)
  return parsed, unparsed

Instead of stuffing into lists, you can immediately write to files. At a minimum, this will reduce the number of "unknown" numbers.

D
Dimonchik, 2017-11-23
@dimonchik2013

num = ''.join([x for x in num if x.isdigit()])

I
ivodopyanov, 2017-11-24
@ivodopyanov

1. Filter out everything that is definitely not a phone (texts "no phone", etc.)
2. Turn numbers written in words into numbers.
3. Leave only numbers and "+" in the text.
If it is guaranteed in the dataset that only phones can be with numbers (there are no ip-addresses, postal codes, passport data, etc.), then it should work.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question