Ultralearning Data Science

published on October 20th, 2019

My last post was a summary of the book Ultralearning by Scott Young. As a learning addict, I am really excited by this book and want to try out the techniques in my own learning projects. After weighting a number of different subjects I've settled on learning data science. It so happens that I'm changing jobs towards the end of the year which will give me a little bit of much needed time. Overall I have approximately 2,5 months until starting in the new place.

Goals

Learning data science is quite vague - what do I actually mean by this? A better definition of my goal is the following:

Given a data modelling problem, I want to be able to come up with a reasonable solution and solve all aspects of the problem end to end.

This is more concrete and can be made even more precise by explaining what I mean with some of the terms:

data modelling problem: This could be any Kaggle challenge or similar problems of my own choosing.
reasonable solution: Something that is in the top 10-20% of the Kaggle submissions (tbh I don't know how hard this is - if it is really hard I would also be happy with a lower percentile).
solve all aspects of the problem end to end: Starting from an idea, I want to be able to design the overall appraoch (data needed, models, what the final solution should do), obtain data, clean it, select and try different models, optimize and build a working solution from this.

Basically, I want to be able to approach problems in the data science space and come up with reasonable solutions that do the job, even if they are maybe not the absolute best.

A second goal in the learning data science space is that I want know and understand the most important models, techniques and tools. I want to be able to understand the solutions of others and have some intuition why an approach works or doesn't work, what else could be tried, etc.

Non-Goals

2,5 months while having a job and a family isn't much, so there must also be some things that I will not learn. Amongst these is that my goal is not to deeply understand the math behind most models. While I would also like to aquire some background and understand why some methods work or do not work (which will certainly include some math), my goal is not to understand and/or be able to reproduce deriations or proofs. If I succeed in this challenge I might take this up later, but its not something that I want to include right now.

Learning Approach and Materials

I am planning to mainly work with the following learning materials:

Primary Learning Materials

Kaggle challenges: A great source for actual problems to work on. It also gives the opportunity to learn from the solutions of others. There are also some introductory challenges which are targeted at beginners such as myself.
Fast AI: The hands-on courses by Jeremy Howard seem ideal for the type of learning I want to do. I want to complete the Introduction to Machine Learning for Coders and the two deep learning courses.
Personal challenges: I have some personal data modelling challenges that I would like to have solutions for. Explaining these is outside of the scope of this blog post, but one of my goals is to be able to solve these. I will get to them in the second half of the challenge what I have hopefully aquired a bit more knowledge.

Background Material

In addition to the primary learning materials, I will use the following books to suplement my learning.

An Introduction to Statistical Learning & Applications in R
Pattern Recognition and Machine Learning
The Elements of Statistical Learning Data Mining, Inference & Prediction
Deep Learning: Adaptive Computation and Machine Learning

The goal is not to read these books cover to cover, but to use them as references or background information in case I feel like I need some additional explanations or would like (and have the time) to go deeper in a certain topic.

My Background

I have a background in mathematics and finance and have worked as a programmer (and more recently CTO) for close to 10 years, so I won't need to learn programming or unix tools. I am also familiar with probability theory and statistics and have dabbled with data science in the past (such as taking the well-known Stanford machine learning course by Andrew Ng on Coursera) but have never done anything real with it. This background certainly makes this challenge more approachable than for somebody truely starting "from scratch".

Material for Drills

I don't really know yet where my weaknesses will be and what parts I need to drill on, but I believe that working on existing Kaggle challenges, possibly copying the work of others and then focusing or experimenting with a specific aspect would make good drills.

Retrieval and Retention

To optimize learning, I want to practice free recall for the lectures I watch. So after watching a lecture, I will sit down and write down the things that I can recall from it. I want to use the Feynman technique (write a concept on a sheet of paper, then explain it in depth without looking things up) on concepts. Another approach that I would like to take is to explain or discuss the things I have learned to my colleagues.

To limit forgetting, I plan to write and review flash cards using Anki on concepts I want to remember. I have used Anki also when refreshing my algorithm knowledge, and it has been a tremendous help in commiting things to long term memory. By doing this I've realized that I was often hampered by having already forgotten things that I had previously learned.

Scheduling

Having a job and a family makes finding the time for this endeavour challenging - it often seems like finding the time is one of the hardest things. On week days I plan to:

work 45-60 minutes on this in the morning (getting up earlier)
spend around 30 minutes of my lunch break on learning
spend another 45-60 minutes in the evening

This gives between 2 hours and 2,5 hours per week day. On weekends I will be able to put in 2-3 hours per day.

This comes out to about 15 hours per week. This regime will last for 6.5 weeks, which gives 6.5 * 15 = 97.5, so almost 100 hours.

There will be about 4 weeks where I will be off, and during this time I will be able to spend more time on learning, probably about 35 hours per week (I still have family committments), this gives me another 140 hours approximately.

So all in all I can dedicate about 240 hours to this project - this is a little less than 1.5 months of full time work.

There's Never Enough Time

Again, it feels like with a job and family, there is never enough time to really learn anything, but 240 hours is certainly something. Anyway, I am very happy with my personal situation and would not change it for anything in this world, so in spite of some semi-regular whining about time, what I really want to do is do the best with this time and make the most of every minute. I am a little embarrassed to admit that more often than I like, I don't make the best of the little learning time that I do have and spend some part of it by browsing HN or losing time in similar ways. Doing it better this time will be essential to make this project a success. The upside of having a busy schedule is that it forces you to be more efficient and careful what you spend your time on.

To be mindful of my time, and also to check if I can really meet my estimates of the time I can spend, I will use a time tracking app to keep track of what I am doing and how much time I spent on each activity. Hopefully this will reinforce that I have a limited amount of time to dedicate to this project and improve my focus. It will also enable me to review how much time goes into each activity and re-adjust my approach based on the perceived value of each item. I will report on this in my weekly reports (see below).

Getting Feedback

A last thing that I think is really important is to get enough feedback during the learning process. I want to collect feedback as follows:

Kaggle results: How do I perform on the Kaggle challenges that I attempt?
Meta feedback: How do my Kaggle rankings develop? Am I getting better? How quickly (i.e. what's my learning rate)? If not why not?
Weekly review posts: I want to write a weekly post discussing my progress. I want to submit these posts to reddit (e.g. /r/datascience or /r/learnmachinelearning) and then get some feedback and advice via the comments. I don't know if they want me, but I will try.
Local meetups: Munich has two large data science meetups - I want to present this learning project there and hopefully find some mentors who can provide direct feedback on my resuls and how I should adjust my schedule. Maybe I can even find some kind of mentor who can help me with direct feedback.

Plan for Week 1

To kick this off, here is my plan for the first week:

Getting a working setup up and running, probably a remote setup on a site like Paperspace.
Fully complete at least three Kaggle challenges, starting with the Titanic challenge and the Bluebook for Bulldozers challenges.
Fast AI lectures: Watch and take notes on the first two lessons from the machine learning course on fast.ai and the first lesson in the deep learning for coders course.
Write a blogpost that reviews the results from last week and plans the next week. I want to report the time I have been able to spend, concepts, models and insights I have learned, the Anki cards that I have created and my plan for the next week.

Taking the Leap

It does feel a little bit scary to commit to this challenge, especially with the plan to put myself out there by blogging about it, submitting the posts to the reddit crowd and going to local meetups to talk about it. But this level of committment also makes this fun and exciting and I think much more likely that I will actually learn a big deal instead of dabbling again a little bit before giving up. Stay tuned!