Plumbing Data Science Pipelines

July 27, 2017.Krishnapriya Satagopan

To supercharge data science teams and enable development of data products that react efficiently to signals from the wild, we need to build stable infrastructure that supports real-time, near real-time as well as batch processing.

A particular early challenge we faced here at Mad Street Den was building scalable, multi-tenet and cost-efficient real-time data pipelines that serve use-cases like real-time sessionization of clickstreams. At the Fifth Elephant conference in Bangalore this week, we will be sharing some of our hard-earned insights in building out asynchronous data workflows based on Celery, our weapon of choice.

Data – There is a lot of it!

Data needs to be aggregated for every stage and made compatible for analysis & consumption

The plumbing story involves three important phases :

Preparation: Ask questions, collect & organize data
Analysis: Aggregate data, find patterns & relationships, summarize
Application: Make decisions, Share results and Visualize

We will be looking at a simple use case of Logging to understand pipelines and plumbing with Celery, RabbitMQ, Redis and ELK Stack Architecture.

Poll SQS queue
Process the logs
Push to Elastic Search and view on Kibana (ELK Stack)

Some nuggets on the ‘How’s and ‘What’s of pipelines:

Creating Asynchronous celery tasks.
Exponential Backoffs with Celery
Rate Limiting your Application
Persistence with Redis
Autoscaling with Celery
Monitoring with HTOP

Memory can be a real bottleneck while building such an application. Handling memory issues with large objects in code, cyclic references and redis sessions are some ways of profiling.

Lessons learnt

Task based – Memory | Compute – Separate Workers
Input Based – Producer Consumer throttling
The Rabbit is best kept away from the tasks
The small files problem (Streaming vs Batching)

Want to know more on this tech stack and how it helped in building ‘Real-Time’ Scalable Data Workflows at Mad Street Den?

Be sure to catch our talk at Fifth Elephant 2017, on the 28th of July!

Categories: Data Science

Tags: #Fifth Elephant

Plumbing Data Science Pipelines

Related Stories

Driving a Culture of Data Science