L
L
lPolar2015-03-24 23:16:03
Python
lPolar, 2015-03-24 23:16:03

Is Python as good as R for data mining?

Hello!
I have been using Python for a year. The main stack of tasks is collection (read - parsing), analysis, visualization and modeling.
At first, everything suited me in the language, until the out-of-core tasks began (the amount of data is strictly more than RAM).
I know about data chunk-reading and partial_fit in sklearn, but this approach noticeably slows down the process of building models and reduces their quality.
Plus, there are some totally annoying language problems:
1. Unicode. Yes, it got better with py3, but not everywhere. Typical example:

import pandas as pd
rdf = pd.DataFrame(['привет','юникод'])
rdf.to_clipboard() # допустим, хочу перекинуть таблицу в excel

2. Lack of a normal package repository.
Of course, tools like pip,easy_install,conda make life easier, but often the required packages have to be manually compiled (cxOracle as an example).
3. Absence of many data mining methods.
For example, deep learning for python is not so easy to find, but it is very difficult to use it.
Now from the R side.
What we liked:
1. The package repository and the system for installing them in general
2. There are packages for literally everything, especially out-of-core processing capabilities, such as ff / bit, pleased.
3. Multivariate solution of the problem (and in data mining this is just fine).
What did not like:
1. The syntax for solving some problems is not entirely obvious. The presence of magic %in%, etc.
2. Some obvious things are poorly implemented, such as import/export to Excel.
Actually, the question is for those who had experience in using both one and the second language - what will they focus on? Learn Python further, learn the API for pyspark and graphlab?
Which of the languages ​​in the foreseeable future will be more in demand in data science?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
P
polyhedron, 2015-03-25
@lPolar

I use both languages ​​and, frankly, I like R better. And you are right that there are packages for absolutely everything. But Python has a number of advantages, the main of which is the developed ecosystem of the language. The advantages of Python are very well described here . In general, this blog has a lot of interesting articles on both Python and R. As for deep learning, there is a wonderful Theano library for Python .
I would recommend focusing on Python, but also keep R in mind in case you need methods that aren't implemented in Python, or work with people who only know R.

G
globuser, 2015-03-25
@globuzer

Any specialized tool (language) or a bunch of them is only as good as the specialist who uses them is, and of course, how good is his theoretical background in data minining, mathematics, and statistics.
Sometimes, even perverted, it is possible to solve the most complex algorithmic-statistical problem that entails data analysis can be solved in a language completely not intended for this.
As far as Python and R are concerned, both are good, seriously, both. The only thing for Python is to deliver additional libraries and modules. BUT! Each analytical and algorithmic-statistical task also has its own specifics, binding to a particular technology, theory - and here, in this case, you just have to figure out which language to use - PYTHON or R, or maybe both of them together, or can also connect STATISTICA or something else, even the usual MathLAB or Ecxel can be of invaluable help. A task, especially a complex one, must always be solved in a complex way! Then success is guaranteed with a probability close to one!

A
Andrey Andrianov, 2015-04-02
@aTwice

The R syntax made my eyes water.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question