I recently started a bootcamp for Data Science. It’s an intense education program that trains its students the skills needed to become a data scientist. I am going to blog my experiences with it as well as any insights that might be helpful for others in similar programs.
I guess I should start by sharing a little info about myself. I have been in software and technology services sales for many years. I enjoy it because, not only, do I like technology but it is great how companies can use tech to improve their business processes to be more productive saving money as well as improving their profits. I have been a little bit of a nerd since I was a kid. I learned (on my own) some basic programming languages as a teenager and have continued to program for fun (as well as for work) as an adult. I have also taken some college coursework too. I took a course for C at a local college (DePaul college) as well as earning a certificate for Web Development.
One of the reasons that I selected this data science program is that I want to be even more technical in my day to day work plus I am fascinated with data and all that we can do with it, especially predictive analytics. The program has several modules where the students learn different aspects of data science. I just completed the first one. This module has a range of topics including
- Learning GitHub and Bash Shell.
- Python language. It starts with “stuff” like conditionals, lists, dictionaries, loops, functions, Pandas, Numpy as well as stating to more advanced programing.
- Data visualization with Matplotlib & Seaborn.
- Stats and statistical analysis (and all the math the comes with it!). Some of these topics: correlation & covariance, linear regression, ordinary lease squares, R-squared, significance and p-value, multicollineary and multi-regression.
- Started learning some Machine Learning methodologies.
I have found some great resources to assist my learning and coding. As always, Khan Academy is a great. Another education site that seems to be designed for elementary and junior high students that’s great for quick and easy “look-ups” for definitions and math formulas is “Math is Fun”. Of course, there’s always a bunch or lectures available on YouTube. Be sure to “checkout” Corey Schafer’s Python tutorials. He is absolutely amazing. Chris Albon has a Website that is great for providing shortcut for code and its syntax. His Website Link is https://chrisalbon.com. If you are considering a course like this one, I would recommend a short book, “Learn Python the Hard Way” before you start the program. It is a code along book that helped me to think like a developer by learning the basics skills of Python. Oh, the Medium website/bogs has great articles about Python and Machine Learning. Once you have read a a few topics on that Blog site, it then will push to you in your email inbox other relevant articles of interest. I’ll be sure to share more resources in my future blogs.
At the completion of our first module, we had to do a project. It ,actually, was a lot of fun. I learned a lot from it. It was a quasi real world project looking how factors in the housing market affects the prices of homes. After importing the data, we started with a fairly comprehensive data cleansing phase. Not only did we limit null values, we also eliminated data features with multicollinearity. I just learned about that. It essentially eliminates (as much as possible) data features that produce the same type of results. Thereby, looking only at data features that affect your dependent variables. There’s more about it but that’s a good starting definition (at least for me). Next we explored the data to prep us for the models that we wanted to deploy What I did is looked at features that affected housing prices the most to figure out which ones are the best options. The next phase of the methodology was to “run” a few machine learning models. A nice model to figure out the best features is Recursive Feature Elimination (RFE). A Python libary, Sklearn, has prebuilt method that trains the data to essentially “run” feature eliminations and multicollinearity to rank the data features. Once it’s completed, you will have the features ranked from best to last. From there, I looked at p-values and R-squared calculations to check data validity. There is a Python library, Statsmodels, that you can use to get those values as well as other tools to check the validity of the data. It’s all done with just a few lines of code. We finally outlined our interpretations and gave recommendations. This project was a full week of work for a beginning Python programmer. I could do that same work now in half the amount of time depending on the data.
I have made it past the first module for this program. Several more mods’ to go. It’s exciting and daunting at the same time. I will keep you posted of my progress and some of the “stuff” that I learn from it.