Getting started with data science and machine learning in Python

Useful libraries for data visualization

Great overview / comparison of these libraries, at pbpython.com.

There are so many libraries and frameworks used in Python for data analysis that I had to take a step back and illustrate how they were laid out.

  • PyDataSet
    • Provides instant access to many datasets right from Python (in pandas DataFrame structure).
  • NumyPy
    • The fundamental package for scientific computing with Python. Fairly low level tool.
    • Pandas
      •  Built on NumPy, and adds much more. Provides rich time series functionality, data alignment, NA-friendly statistics, groupby, merge and join methods, and lots of other conveniences.
    • SciPy
      • Collection of mathematical algorithms and convenience functions built on the Numpy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data.
        • Scikit-learn
          • I ALWAYS USE THIS
          • Module for machine learning built on top of SciPy
          • Simple and efficient tools for data mining and data analysis
          • Built on NumPy, SciPy, and matplotlib
  • Matplotlib
    • You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code.
    • Seanborn
      • Visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
  • Plot.ly
    • Easily turn your data into eye-catching and informative graphics using our sophisticated, open source, visualization library and our online chart creation tool. Will not work in Azure Notebooks. Can be hosted in their cloud, or offline, but requires a large amount of data IO, which exceeds what the notebook can handle.
  • Cufflinks
    • 3D data visualization – the power of plotly with the flexibility of pandas for easy plotting. Also will not work in Azure notebooks, due to the IO restriction.

python-overview

Useful Python Tools for ML

These will allow you to get data science and machine learning tools running on your local machine, browser, or even the cloud. I use all three of these in my day-to-day work.

  • Azure Notebooks (Browser)
    • Do all of your ML work from within a Jupyter (Python) notebook inside the browser. No setup / install. You can have public / private notebooks. Great way to share your work. Free.
  • Azure data science virtual machine (VM )
    • You can run this either as a Windows or Linux (Ubuntu) VM. Click some buttons and you are good to go. Just make sure you turn auto-shutdown on!
    • The VM will have just about every tool you’d ever need to do data analysis or ml.

Udemy Courses

I get a lot of value from seeing how other people code, so I’ll watch these videos and code alongside them in an Azure Notebook.  They are also very affordable, at around $12 per course. Typing in each line of code helps me remember. You can find all of it here.

Machine Learning

Deep Learning

Books

Sometimes I prefer having a tangible item in my hand to highlight, take notes, and read on a plane. I’ve found these three books to be the most useful in my studies.

Azure Notebooks

I mentioned these above, but for me the real value lies in samples and tutorials provided with them. The image below is how I progressed through them, broken down by difficulty. If you are brand new to the field of data science it would be a great place to start.

These are all available on the Azure Notebooks landing page.

Here is a .pdf of the order I would do them in, beginner -> advanced

My Azure Notebooks

 

machine-learning-algorithm-cheat-sheet-small_v_0_6-01Additional Resources

 

 

-----------------------


subscribe-to-youtube

One thought on “Getting started with data science and machine learning in Python

  1. Pingback: Getting started with data science and machine learning in Python - Dave Voyles - IntelliNova

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.