August 17, 2018
Before we dive into the “How”, let’s turn our attention to the “Why”. We are living in a data driven world. What makes companies valuable is the volume, uniqueness and quality of data they have accumulated through years of services. The insights squeezed from data gives companies leverage over their competitors. At the same time, more people than ever in the history of the world now have the luxury to be online and be a consumer of a plethora of online services. Evidently so, the volume of data has grown exponentially, and it will only continue to grow indefinitely. Companies nowadays are in constant need of more qualified people who can work with these massive collection of data properly and help solve real problems for the companies and help them continue to improve their products and services.
“Data scientist has ruled as one of the hottest jobs for years, proven by its third consecutive No.1 ranking on Glassdoor’s 50 Best Jobs in America list. This is due to the high demand (4,524 open jobs), the high salary ($110,000 median base salary) and high job satisfaction (4.2). Not only are tech companies scrambling to hire data scientists, but industries across the board, from health care to nonprofits to retail, are also searching for this talent.”
– Andrew Chamberlain, Chief Economist, Glassdoor
As of writing this article, according to Glassdoor reports, the average base salary of data scientists is a staggering $120,931/yr and the median base salary is $110,000/yr. At the end of the day, there are plenty of ways in which you can earn money. What is the bigger motivation? As a data scientist, you will be in a position to better understand the world and why people behave the way they do. You will be able to help countries shape policies, NGOs mitigate threats, help companies make a fortune and just maybe in your spare time, predict the future!
I was going to say learn any programming language. But I know time is of great essence and if there is one programming language you can take out the time to learn, let it be Python. Why? Python is the arguably the most popular programming language out there for its simplicity (readability) and usefulness. It’s more easy to comprehend with its simple syntax. Some programming languages are overloaded with parentheses, brackets, braces, commas and colons, but Python is simpler in that respect and also eliminates redundancy. It’s very powerful, yet intuitive to use. In a previous article, I explained how to setup your computer to write and run Python scripts. Once you get around the basics of Python, you will need to devote your time to understanding the existing libraries out there. You will need to understand the things that you can do with them and how you can implement the functions the libraries provide in your code.
A guide to some Python libraries that you should be familiar with:
You will need to harness various concepts of Statistics and Mathematics in general to make sense of observations in the real world. Statistics is generally regarded as one of the pillars of Data Science. But since it’s such a vast field of study, it can get quite strenuous and even intimidating, especially if you do not know where to start. Luckily, there is a great playlist that has been created by Siraj Raval on various concepts of mathematics needed for Machine Learning (also applies to Data Science) called the The Math of Intelligence. For starters, you will need a solid understanding of probability, statistical inference (hypothesis testing, p-values, confidence intervals), regression models and a basic understanding of correlation.
“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”
The world is full of data. All the companies that you have ever heard of store some sort of data from the services they provide. From Facebook to The New York Times, all these companies use databases, which is a program that helps store data, as well as provide other functionalities like adding, modifying or querying data from these databases. SQL (Structured Query Language) is a language designed entirely to interact with these databases. You will need to know SQL to do a whole lot of things like to add, modify or pull data from these databases quickly. One of the best resources out there to learn SQL from is SQL Zoo. Khan Academy also offers a free course called Intro to SQL: Querying and managing data which is a great place to start learning. You could also benefit from hundreds of free SQL cheat-sheets out there since you need to routinely revise these queries to help ensure remembering them.
Algorithms are as important to computer programs as recipes are to cooking. Algorithms will give you a set of instructions to follow, a guide if you will, to implement your code efficiently. There are plenty of algorithms with their own distinctive technique on approaching a situation. You will need to study them all and understand when to use which algorithm, depending on the circumstances of the problem you are trying to solve. The algorithms can be categorized into 3 fundamental kinds: Supervised Algorithm, Unsupervised Algorithms, Reinforcement Algorithms. Here’s a guide to some algorithms that every data scientists should know: Linear Regression, Logistic Regression, Naive Bayes, K-Nearest Neighbor, Support Vector Machines, Decision Tree, Random Forest. Once you get around the concepts, it is extremely important to implement them yourself to really understand how they work. There’s a really cool GitHub repository on minimal and clean examples of machine learning algorithms implementations. You can fork the repo and perform the code on your computer.
You need to build the capacity to communicate your results. You need to know how to describe your results well, what are the possible explanations of your results, and what is the best way to present them. Sometimes no matter how good your core analysis is, if you fail to communicate your results with others, or present your insights in the most comprehensive way, your analysis will be undermined. In order to represent your results well, you need to know how to use different data visualizations libraries in Python. Additionally, you might find yourself at an advantage if you know how to work with technologies like Tableau. People find it easier and more convenient to gain insights from data using visuals as opposed to rummaging through huge amounts of data.
I cannot emphasize enough on how necessary it is to become a part of the community. Many people feel that it’s a solo journey all the way, which couldn’t be further from the truth. Being a part of the community will not only help you absorb the collective knowledge of the people around the world but also tip the sails of your ship to where the world is currently heading. It’s hard to fall behind when you are an active participant in public discussions and forums. You should definitely join and explore GitHub regularly, which has over 30 million repositories and over 12 million users. Almost all companies, big and small, open-source a lot of their resources for you to use for absolutely free and contribute further to the project. You can see what people worldwide are currently working on and the level of sophistication of code that is required.
This article is by no means a complete list of all the skills required to be a really good data scientist. I did not even include the name of books you should read and there are plenty of areas I have deliberately skipped. The purpose of the article is to give you a broad sense of the kind of skills that are expected from an individual if (s)he decides to pursue data science. The world changes fast and every day some things become irrelevant. Libraries become deprecated, newer improved libraries are introduced. It is futile to fight the test of time because nothing can and nothing will.
© Amitabha Dey. All rights reserved.