Programming and Data Analytics Resources


This page lists the main resources I've used to learn Python and data analytics. For each resource I'll explain why I chose it and how it fits into my learning path. At the end of this page I've also included a list of the tools I've learned and use regularly.


Python basics

I've always thought that to really learn something it's most important to understand the fundamentals very well. So to learn the basics of Python I used a number of different resources that repeated much of the same material.


Hands-on Python 3 Tutorial, Dr. Andrew Harrington

There are a lot of basic Python online tutorials out there, and many much more well-known than this one. But Dr. Harrington's style of introducing the material just clicked with me and it really is "hands-on" as the title suggests. This was the first resource that I went through from beginning to end to try to learn Python, and I loved it. I highly recommend this tutorial to any complete beginner looking to get started with Python.


Corey Schafer's Tutorials

I think these are the best Python video tutorials out there. Corey is an excellent teacher and his tutorials cover not only the basics, but also data analytics, visualization, and web development in Python. I used them at the beginning of my learning journey, and I’m still using them. I actually built the basic structure and layout of this website using code he introduced in his Flask tutorial series.


Automate the Boring Stuff with Python, Al Sweigart

This book, which is free to read online, is great because not only do you learn the basics of Python, but if you follow along and actually complete the projects, when you are done, you will have programs that you can actually use in your daily life and work. I am still using the multi-clipboard program I made to store text that I use regularly in emails and messaging. Hands-on learning really is the only way to go to learn how to code and this book is really great for that.


Think Python, Allen Downey

This book does start from the very beginning (the first chapter is called “What is a program?”) but I recommend tackling it after you have run through a couple other beginner level tutorials. The hands-on exercises get quite difficult and it goes into some pretty advanced stuff in the later third or so of the book including object oriented programming with classes and algorithm analysis. I actually stopped going through this book at the chapters on classes and came back to complete it later, once I was ready for that stuff. Although this book is more challenging, I highly recommend it. Allen Downey’s books – this one, and in particular another that I will introduce below called Think Stats – have been essential parts of my Python and data analytics learning.


Learning Python, Mark Lutz

I’ve used this book more as a reference book than a learning course book. At nearly 2000 pages in length there is nothing that this book does not cover. It is my go-to resource whenever I want to get an in-depth explanation of how something works in Python. I actually did go through the sections on object oriented programming using classes more systematically to learn that, and it helped me a lot. So it can also be used as a learning course book too. I just wouldn’t try to tackle it all in one go!


Data Analytics

After learning the basics of Python I looked at all the different uses for it, and I decided that data analytics would be the most useful area for me to apply what I’d learned. I work in HR and I could see a lot of applications for Python in my day-to-day work, from automating data cleaning to analyzing people data to gain insights that could help us better make better decisions.

When I first started learning data analytics in Python I didn’t have any experience working with data tools other than Excel, and honestly I had no idea just how much there is to learn. Through my learning in this area though I’ve come to find that I love working with data, and I plan to continue building my data skills, especially for the specific purpose of people analytics, into the foreseeable future.

I’d like to note though that I’ve made a decision to draw a line, which I do not plan to cross, in how far I go with my learning in this area. And on the other side of that line is machine learning. It’s possible for anyone with even amateur Python skills like I have to build machine learning models, but I don’t think this is a good idea. There is so much underlying knowledge needed to do machine learning well, so I think this should be left to be people who have many years of specialized training or advanced degrees. I’ve decided to focus on learning the fundamentals of good data analysis, which for me means expertise in using tools for analysis and visualization, combined with a strong understanding of statistics, probability, and distributions.


Coursera - Applied Data Science with Python Specialization, University of Michigan

This specialization is what I would call a data science “crash course” that goes from basic analysis all the way to machine learning, text mining, and network analysis. I’m not a fan of this kind of learning because I think it makes people believe they have learned a lot when really all they have is a very superficial understanding of things that actually require a lot of study and practice to truly understand and master. Nevertheless, I found the first two courses which covered basic analysis (numpy and pandas) and visualization (matplotlib and seaborn) libraries to be good introductions to the tools I would need to learn. I did not continue on to the third course on machine learning and don’t plan to do it.


Python for Data Analysis, Wes McKinney

For me, this book is the bible for learning data analysis in Python, especially pandas, since Wes McKinney is the creator of the library. I went through all the sections, and in the pandas section in particular I made sure to practice the methods carefully to understand them well. Although it works as a study book, it’s written in kind of a more reference book style, and I do still refer to it when I need a reminder on how something in pandas works, which is actually quite often.


Data School, Kevin Markham

Kevin’s video tutorials on pandas are really easy to follow and learn from. I used these videos in conjunction with Wes McKinney’s book, and this worked well for me because these video tutorials are done in a quite hands-on manner, at a relatively simple level, whereas the book gets more into the details on how to use pandas methods.


Think Stats, Allen Downey

Once I’d learned the basics of numpy and pandas, I realized that if I wanted to really be good at data analysis I had to build up a strong understanding of statistics. Since I was using Python in all my learning to this point, I picked up this book, and I can easily say that I’ve learned more from it than any other resource I’ve listed here. I literally spent the better part of a year going through it, reading and re-reading, and doing every exercise. The way Professor Downey teaches statistics through this book is unique. He teaches you how to use the power of computation (programming) to do statistical analysis rather than using formulas. I found this approach really helped me to intuitively understand the ideas.

In this book, Professor Downey uses custom modules of code for analysis and visualization that he built from basic Python data structures without using libraries like scipy.stats and statsmodels that obscure a lot of the inner workings of the code. This helps readers understand more deeply what the code is actually doing, which in turn shows how the statistical methods actually work. His approach inspired me to build my own Python package based on what I was learning, so I could have my own toolbox of functions and classes for my future data analytics work and not be dependent on the code from the book. The result of this is a package I built called datastats. This package has modules for working with both univariate and multivariate data, and for hypothesis testing and plotting.

To get the most out of this book, fork the github repository for it, and while reading it go through the Jupyter notebooks for each chapter, practicing everything and doing all the exercises. You can find the work I did using the ThinkStats repository here.


Khan Academy Statistics

While going through Think Stats I also worked my way through all the statistics and probability tutorials on Khan Academy. I’d learned most of this stuff in my high school and university math classes but that was a very long time ago so I definitely needed to relearn everything.


Introduction to Statistics, Jim Frost

Jim Frost introduces statistical concepts in a very easy-to-understand way, using simple language and not a lot of formulas. I found this book helped me strengthen the foundation of what I was learning in statistics.


Hypothesis Testing, Jim Frost

By far, the part of statistics I spent the most time learning was hypothesis testing. I even built quite complex classes for doing many kinds of hypothesis tests in my own custom module. These classes use nonparametric computational methods that I learned through the Think Stats book, but I also wanted to understand the parametric methods that most statistics courses use. This book helped me learn these standard parametric methods and better understand how the two approaches differ.


Introduction to Modern Statistics, Mine Çetinkaya-Rundel and Johanna Hardin

The Introduction to Modern Statistics (IMS) online book is an excellent free resource for anyone interested in learning statistical exploratory data analysis using modern techniques (ie. programming). In the case of this book the programming language used is R. However, I'm accustomed to using Python, so when going through the lesssons in this book, in addition to completing the exercise questions, I also used Python to recreate the analysis and visualizations in each chapter. I've found this to be really good practice for my programming skills and at the same time I can learn more about the analytical methods introduced in the book.

You can find all the work I've done while studying this book here.


Matplotlib Documentation Tutorials

When it comes to visualization using matplotlib, I learned almost entirely by just creating plots as part of my statistics learning. I didn’t really make use of any particular resources other than the documentation and tutorials on the matplotlib webite and searching stackoverflow for answers when I got stuck. One thing I think is really important to understand early on is the difference between the two different object-oriented and pyplot interfaces. If you don’t get a handle on this from the start it’s easy to get quite confused. I personally prefer the object-oriented style for using matplotlib.


Seaborn Documentation Tutorials

As with matplotib I really only used the website documentation and tutorials to learn it. If you’ve already learned matplotlib, I highly recommend spending some time to learn Seaborn next because it greatly simplifies a lot of the plotting that you commonly need to do in exploratory data analysis. And, since it is built on top of matplotlib, you can “drop down” into the underlying matplotlib elements of your charts to fine tune them to your liking.


Data Points, Nathan Yau

I found this book to be a good overview of how to use visualization effectively to analyze data and present it to an audience in different contexts.


Storytelling with Data, Cole Nussbaumer Knaflic

This book is the best one out there for learning how to communicate the results of an analysis to an audience effectively. The techniques are clearly explained with great examples. I always try to apply the lessons I learned from this book when I’m creating my charts and I feel like I’m slowly getting better at it. This is a must read for anyone who needs to communicate with data.


Core Programming Skills

I’ve only really dabbled in this area, but in the future, if I end up doing more large-scale data analytics and building bigger programs, I plan to spend more time learning about algorithms, data structures, and programming patterns to build up my core programming capabilities. I've only read parts of many of the books listed below but I've found them to be good resources.


A Common Sense Guide to Data Structures and Algorithms, Jay Wengrow

This book uses really simple language and practical examples to teach different types of algorithms and data structures. There are also short exercises to reinforce what you learn.


Applied Computational Thinking with Python, Sofia De Jesus

This book seems to like a good one for building logical reasoning and problem-solving skills and learning how to apply them to building programs in Python.


Composing Programs, John DeNero

Just copied straight from the intro page: “In the tradition of SICP, this text focuses on methods for abstraction, programming paradigms, and techniques for managing the complexity of large programs. These concepts are illustrated primarily using the Python 3 programming language.”
* SCIP is the Structure and Interpretation of Computer Programs (aka. The Wizard Book)


The Pragmatic Programmer, David Thomas and Andrew Hunt

I read this book all the way through, save for one section on concurrency which seemed to be quite a far ways beyond anything I would be working, and, although I'm very much an amateur programmer, I found it to be very helpful. Through this book I was able to better understand a number of best practices like Don't Repeat Yourself (DRY) and decoupling (how to write code that makes it easy to change one part without breaking other parts) that I can apply in my small scale projects too. I also found the coding by "tracer bullets" concept introduced in the book especially interesting as it outlines a method of software development that aligns well with the lean practices I've been learning a lot about and plan to implement more in my work in the future.


The Tools I Use

In this section I'll list the programming and data analytics tools I've learned and use on a regular basis.


Python

Python is the only programming language I know, unless you count the limited HTML /CSS understanding that I've picked up along the way. I use Python mostly for data analytics, but I've also used it for web development (ie. building this website) and to build small automation programs.

I still have a lot to learn to enhance my core Python skills, but so far I've built up a good understanding of data structures, flow control, functions, and classes. Additionally, I've also learned and use the following Python packages, libraries, and frameworks.

  • numpy: array computations
  • pandas: data analysis, table manipulation
  • scipy.stats: statistical functions and probability distributions
  • statsmodels: statistical modeling (eg. regression analysis)
  • matplotlib: data visualization
  • seaborn: data visualization for exploratory data analysis
  • Flask: web development


Jupyter Notebooks

I learned how to use jupyter notebooks early on, and I do most of my coding in notebooks because it makes iteration and testing of ideas easy. To be perfectly honest though when I write my code in notebooks things do get messy, with a lot of trial and error code bits, comments, etc. strewn throughout. So my workflow is to use notebooks for the code development part of my work and then I move my code to an code editor after I have things working to create clean, well-documented modules and programs.


Visual Studio Code

VSCode is the editor I have chosen to use, and, as mentioned above, I mostly use it at the end of my workflow after having developed working code using notebooks.


Git and GitHub

Once I started building more complex modules and programs I decided to learn how to use Git. At first I found using the command line difficult so I used the GitHub Desktop GUI, but I decided to abandon that and learned the basic git commands for the command line. I think learning the commands is the way to go. Although there is a lot I still don't know about Git and I've never used it to work on a collaborative project, I'm quite comfortable using it now to manage my projects. I'm also now sharing my projects on GitHub and I use it to store a bunch of code snippets that I use in my work.


Anaconda

I use Anaconda, actually miniconda as it's more lightweight but enough to meet my needs, for Python installation and managing virtual environments.


StackEdit

I use StackEdit for any writing that I plan to share online because it's an in-browser markdown editor that also has a desktop application so it's possible to write anywhere, anytime both on or offline.


Modern csv

Modern csv makes working with csv files a breeze. In my case a lot of the data I work with is in Korean which necessitates utf-8 encoding. It's a huge pain to import and work with utf-8 encoded csv data in Excel, and this tool solves this problem.