To supercharge data science teams and enable development of data products that react efficiently to signals from the wild, we need to build stable infrastructure that supports real-time, near real-time as well as batch processing.
A particular early challenge we faced here at Mad Street Den was building scalable, multi-tenet and cost-efficient real-time data pipelines that serve use-cases like real-time sessionization of clickstreams. At the Fifth Elephant conference in Bangalore this week, we will be sharing some of our hard-earned insights in building out asynchronous data workflows based on Celery, our weapon of choice.
Data – There is a lot of it!
Data needs to be aggregated for every stage and made compatible for analysis & consumption
The plumbing story involves three important phases :
Preparation: Ask questions, collect & organize data
Application: Make decisions, Share results and Visualize
We will be looking at a simple use case of Logging to understand pipelines and plumbing with Celery, RabbitMQ, Redis and ELK Stack Architecture.
Poll SQS queue
Process the logs
Push to Elastic Search and view on Kibana (ELK Stack)
Some nuggets on the ‘How’s and ‘What’s of pipelines:
Creating Asynchronous celery tasks.
Exponential Backoffs with Celery
Rate Limiting your Application
Persistence with Redis
Autoscaling with Celery
Monitoring with HTOP
Memory can be a real bottleneck while building such an application. Handling memory issues with large objects in code, cyclic references and redis sessions are some ways of profiling.
Lessons learnt
Task based – Memory | Compute – Separate Workers
Input Based – Producer Consumer throttling
The Rabbit is best kept away from the tasks
The small files problem (Streaming vs Batching)
Want to know more on this tech stack and how it helped in building ‘Real-Time’ Scalable Data Workflows at Mad Street Den?