Data Version Control is an upcoming area necessary for faster implementation of machine learning iterations and still track the changes in data and models.
A machine learning project lifecycle is different than a normal software lifecycle where there is not much dependence on data, whereas in machine learning, each model depends on underlying data and the model behaves differently when data changes.
In simple terms — data changes > ml code needs recalibration > model changes
So there is an absolute need for tracking not just code but the data used to build that model.
In this article, we will…
I’ve been in the Data Science field for more than 6 years and have tried and tested different tools from programming in terminal to text editors and cloud platforms. Also, I’ve used both Python and R, but I now work in python only for the past few years.
In this article, I’ll write about
Python is a general-purpose programming language and can be used for all purposes like — web scraping, automation, building websites, building APIs, and of course machine learning models. …
PySpark is a python wrapper to support Apache Spark. Apache Spark is a distributed or cluster computing framework for Big Data Analysis written in Scala.
Today we will look at how we can build a Multi-layer Perceptron Classifier (Neural Net) on the Iris dataset, including data preprocessing and evaluation.
If using a local computer, I use
findspark package and establish a SparkSession, as below:
We often have a problem when working on different projects in the local system
After stumbling with this problem, I found a perfect solution using two awesome libraries — Pyenv and Pipenv.
Pyenv is to manage Python versions, and Pipenv is to create virtual environments required for each project and manage python packages and their dependencies for each project.
This is a great way of working on different projects…
Data Scientist | Machine Learning Engineer | Data & Analytics Manager