Continuous Integration and Continuous Deployment (CI/CD) is the core thing in MLOps (other being Data Version Control) and we will look at how to use Google App Engine and GitHub Actions to achieve that.

Introduction

As part of a personal project, I was building a dashboard using Data Studio with data from Big Query that is being populated by python code hosted in App Engine. Some of the raw data files, config and ML models are saved in Cloud Storage using Data Version Control using .

Below is my post to understand Data Version Control in detail using — on…


Data Version Control is an upcoming area necessary for faster implementation of machine learning iterations and still track the changes in data and models.

Introduction

A machine learning project lifecycle is different than a normal software lifecycle where there is not much dependence on data, whereas in machine learning, each model depends on underlying data and the model behaves differently when data changes.

In simple terms — data changes > ml code needs recalibration > model changes

So there is an absolute need for tracking not just code but the data used to build that model.

Welcome to Data Version Control

In this article, we will…


why I chose Python and Visual Studio Code for my data science projects…

Photo by Campaign Creators on Unsplash

I’ve been in the Data Science field for more than 6 years and have tried and tested different tools from programming in terminal to text editors and cloud platforms. Also, I’ve used both Python and R, but I now work in python only for the past few years.

In this article, I’ll write about

  • Why I choose Python over R
  • My preferred text editor
  • And other tools I use

Why I prefer Python to R?

Usability

Python is a general-purpose programming language and can be used for all purposes like — web scraping, automation, building websites, building APIs, and of course machine learning models. …


PySpark’s ML Lib has all the necessary algorithms for machine learning and multi-layer perceptron is nothing but a neural network that is part of Spark’s ML Lib.

A high-level diagram explaining input, hidden, and output layers in multi-layer perceptron.

Introduction

PySpark is a python wrapper to support Apache Spark. Apache Spark is a distributed or cluster computing framework for Big Data Analysis written in Scala.

Today we will look at how we can build a Multi-layer Perceptron Classifier (Neural Net) on the Iris dataset, including data preprocessing and evaluation.

Pre-requisites:

  1. You have PySpark available either on a local computer or using Google Colab or in Databricks.
  2. Iris dataset — I’ve downloaded from Kaggle https://www.kaggle.com/uciml/iris. It’s a UCI ML dataset that I’ll be using in this article.

If using a local computer, I use package and establish a SparkSession, as below:


Pyenv and Pipenv are necessary tools if you are working on different projects that need to be deployed to production and maintain a clean codebase.

A high-level overview of how Pyenv and Pipenv are different and solve the bigger problem.

We often have a problem when working on different projects in the local system

  1. we might need different python versions for different projects (less common) or
  2. we might need python packages compatible with particular versions (more likely).
  3. Virtual environments for different projects for easy deployments

After stumbling with this problem, I found a perfect solution using two awesome libraries — Pyenv and Pipenv.

Pyenv is to manage Python versions, and Pipenv is to create virtual environments required for each project and manage python packages and their dependencies for each project.

This is a great way of working on different projects…

Avinash Kanumuru

Data Scientist | Machine Learning Engineer | Data & Analytics Manager

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store