Plumbing Data Science Pipelines

July 27, 2017.Krishnapriya Satagopan
To supercharge data science teams and enable development of data products that react efficiently to signals from the wild, we need to build stable infrastructure that supports real-time, near real-time as well as batch processing.

A particular early challenge we faced here at Mad Street Den was building scalable, multi-tenet and cost-efficient real-time data pipelines that serve use-cases like real-time sessionization of clickstreams. At the Fifth Elephant conference in Bangalore this week, we will be sharing some of our hard-earned insights in building out asynchronous data workflows based on Celery, our weapon of choice.

Data – There is a lot of it!
Data needs to be aggregated for every stage and made compatible for analysis & consumption

The plumbing story involves three important phases :

  • Preparation: Ask questions, collect & organize data
  • Analysis:  Aggregate data, find patterns & relationships, summarize
  • Application: Make decisions, Share results and Visualize

We will be looking at a simple use case of Logging to understand pipelines and plumbing with Celery, RabbitMQ, Redis and ELK Stack Architecture.

  • Poll SQS queue
  • Process the logs
  • Push to Elastic Search and view on Kibana (ELK Stack)

Some nuggets on the ‘How’s and ‘What’s of pipelines:

  1. Creating Asynchronous celery tasks.
  2. Exponential Backoffs with Celery
  3. Rate Limiting your Application
  4. Persistence with Redis
  5. Autoscaling with Celery
  6. Monitoring with HTOP

Memory can be a real bottleneck while building such an application. Handling memory issues with large objects in code, cyclic references and redis sessions are some ways of profiling.

Lessons learnt

  • Task based – Memory | Compute – Separate Workers
  • Input Based – Producer Consumer throttling
  • The Rabbit is best kept away from the tasks
  • The small files problem (Streaming vs Batching)

Want to know more on this tech stack and how it helped in building ‘Real-Time’ Scalable Data Workflows at Mad Street Den?

Be sure to catch our talk at Fifth Elephant 2017, on the 28th of July!

Categories: Data Science
© 2017 Mad Street Den Inc. All rights reserved.