Python For Data Science

Data scientist has been called “the sexiest job of the 21st century,” presumably by
someone who has never visited a fire station. Nonetheless, data science is a hot and growing field, and it doesn’t take a great deal of sleuthing to find analysts breathlessly prognosticating that over the next 10 years, we’ll need billions and billions more data scientists than we currently have.
But what is data science? After all, we can’t produce data scientists if we don’t know
what data science is. According to a Venn diagram that is somewhat famous in the
industry, data science lies at the intersection of:
• Hacking skills
• Math and statistics knowledge
• Substantive expertise

Python has several features that make it well suited for learning (and doing) data sci‐
ence:
• It’s free.
• It’s relatively simple to code in (and, in particular, to understand).
• It has lots of useful data science–related libraries.

The Basics
Getting Python
You can download Python from python.org. But if you don’t already have Python, I
recommend instead installing the Anaconda distribution, which already includes
most of the libraries that you need to do data science.

As I write this, the latest version of Python is 3.4. At DataSciencester, however, we use
old, reliable Python 2.7. Python 3 is not backward-compatible with Python 2, and
many important libraries only work well with 2.7. The data science community is still
firmly stuck on 2.7, which means we will be, too. Make sure to get that version.
If you don’t get Anaconda, make sure to install pip, which is a Python package man‐
ager that allows you to easily install third-party packages (some of which we’ll need).
It’s also worth getting IPython, which is a much nicer Python shell to work with.
(If you installed Anaconda then it should have come with pip and IPython.)

Leave a comment

Design a site like this with WordPress.com
Get started