..No, this will not be about ‘big data’.
It will be about the very existence of the data and its quality.
How important that is, and you can not overemphasize it.
Obviously, we have an understandable tendency to follow the trends.
Sometimes we want to skip few steps and jump to the end, where things all seem to come together.
This sometimes mean that the ‘evolution of our progress’ is somehow disturbed. And then the endproduct does not always look exactly as we want. It feels like we are there, and also sometimes as we are far away. A mixed feeling so to say.
Now what do I mean and how does this relate to data science?
‘Data science’, starts with ‘data’.
Sometimes you have a purpose and strive for the data that will do it.
Sometimes you already have the data and you look for ways to exploit it.
Bottomline: it all comes down to data, and unless you have ‘valuable data’, you will not be able to contribute.
And ‘valuable’ has few contexts here.
One, why is this data ‘worthy’?, what is its potential?, how much will it serve a purpose?
Two, how good is your data? ‘Garbage in garbage out’, so you better can rely on your data, before you think to start.
And also: How unique is it? And here its not only about the physical data itself but also about the way you interpret it. Is it yours or you kaggled it?
Why am I saying that all.
On your data science journey; I wish you not only think about the last and fancy part (the algorithms,the methodology and yes that includes machine learning stuff too), but also, and maybe with more emphasis, think about what you really want to do, and how to get that ‘valuable’ data that will do it. And also connect that ‘purpose’ with ‘valuable data’ thru an overall strategy that will be implemented finally -and only then- thru a methodology.
And it does not finish there; you have to test it, perform sanity checks and refine each and every step, and this will probably a continuous effort.
And last but not the least, you have to find ways to exploit your work. How can you bencmark your strategy to another scenario? How can you reuse your data; inital raw data and the processed data that you created and your results, to solve a different problem?
You are right, it is not painless :-), Easier said than done.
Let me invite you to
It is a real-life data science project in Python, utilizing Scrapy and Pandas. It includes all the steps from problem definition to the final exploitation of the outputs. So a full back-end project. Detail information and active promotions can be found here: My Udemy Courses
See you there.