🎵
True as it can be
You’re running late for your 8AM
When somebody tells you unexpectedly
‍
It was just a little change
Small to say the least
‍
The scripts failed late last night
The dashboards are not alright
‍
We’re all a little scared
None of us prepared
‍
Scheduling is a Beast!
🎵
Computers have had schedulers for a long time. Since the 1970’s (Version 7 Unix) to be exact.
‍
But a lot has changed since then.
‍
Today, there are many options. And as with all things in life... they come with some tradeoffs.
‍
In this post, we'll cover 2 of the most common solutions (cron and Airflow) and why we think Ludis is the better option.
Because this has been around for so long, every major operating system has some form of simple task scheduler.
‍
Windows has its own tool called Task Scheduler. And both Linux and MacOS natively support cron (accessible through the terminal with crontab -e)
‍
No need for additional dependencies; it comes pre-installed on most operating systems. It runs jobs as scheduled with minimal overhead.
Jobs are defined in a crontab file using straightforward syntax. There is a bit of research you have to do upfront, but after that, you can keep following the pattern.
Good for automating recurring scripts like backups, log rotations, or notifications.
‍
No built-in way to track job execution, failures, or retries.
‍
Because everything runs on a single machine, if two jobs are supposed to run at the same time, they can compete with each other for resources and in the worst case, crash the whole machine.
‍
If the machine isn’t on, the script doesn't run.
Cannot define task dependencies. Jobs run independently without checking what else is running.
‍
Doesn’t support dynamic scheduling (e.g., based on conditions or data availability).
Cron can only run on a single computer at a time!
‍
If anyone with access updates a library version, they could accidentally break all the other jobs running on the computer. If you spin up 2 computers with cron, you have to manually keep all the packages in sync.
‍
This becomes very hard to manage when there are multiple developers and/or lots of jobs. You need a more robust system to manage multiple jobs across a distributed system.
Doesn’t play well when files change. Files need to be manually tracked on an individual computer. If something happens to that specific file or computer, your job won’t run.
Airflow is an open source workflow orchestration tool designed for managing complex, interdependent tasks. It uses the concept of a Directed Acyclical Graph (DAG) to define the logic of a workflow.
Supports Directed Acyclic Graphs (DAGs) to define dependencies between tasks.
‍
Can run across multiple nodes using Celery, Kubernetes, or a database-backed executor.
Provides a UI for job monitoring, logging, and retrying failed tasks. The UI has some role-based access control, but the control stops at the UI. Creating and deploying new jobs still requires access to the server.
Tasks can be scheduled based on conditions, data availability, or external triggers.
Supports plugins, integrations (e.g., AWS, GCP, databases), and custom operators
Requires a good amount of time to set-up properly. There are some tools to help you get started, but you need a database(e.g., PostgreSQL, MySQL), a scheduler, and an executor. Then you need to make sure that it all is actually up and running.
Also, Airflow natively only supports Python. So, if you want to use R, or any other programming language, you will spend a lot of time yak shaving.
‍
Here are just some examples of why your DAG might not even show up:
Requires knowledge of Python and workflow orchestration concepts. In order to schedule a job, you have to write a separate DAG definition.
‍
Using Airflow for basic scheduling (like a simple backup script) can become very tedious.
Runnable scripts have to live in specific directories, DAGs have to be defined in a specific way, and then you have to go to a separate UI to trigger the DAG.
‍
Most of the time you just want to trigger a script without jumping through all these hoops.
Doesn’t play well when files change. Files need to be manually tracked on an individual computer. If something happens to that specific file or computer, your job won’t run. This is even harder than in cron because airflow can be deployed as a distributed system.
Airflow is by engineers for engineers… including looking like it was built by a bunch of backend engineers. Read how we improved on the UI here: Simplifying Workflow Management: A Better Way to Use Airflow.
‍
Ludis Workflows are built on top of Airflow, so we support the same Task Dependency Model, Scalability, Monitoring, and Extensibility features of Airflow. But we also make it a lot easier to use!
No matter how many people are working on a script, Ludis allows you to define access rules for each of the different collaborators. From Sarah your script developer, to James who just needs “view only” privileges to know if the script ran or not.
Every customer gets their own private cluster. This is hosted in your country to comply with any data privacy regulations.
‍
Security is built-in. Environment variables and secrets are encrypted by default so you can store your DB passwords with peace of mind. Small teams often don’t have the time to become security experts, so our goal is to make the most secure choice the easiest option.
‍
With cron and native Airflow, secrets are not encrypted by default, and are often just hardcoded. This can expose your team to a lot of unnecessary risk down the line.
Ludis is a software as a service (SaaS) platform built in the cloud. This means we get to figure out all the infrastructure setup for you.
‍
Workflows just work out of the box. By using a cloud SaaS product, you don’t have to hunt down arcane database commands just to run a simple script. Cron and native Airflow require a decent amount of code based configuration to run any script.
‍
We’ve spent a lot of time making Workflows easy to configure. Ludis lets you define workflows using a drag and drop UI. Even though we’ve built our workflows on top of Airflow… *This is something Airflow can not do.*
‍
‍
Trust us, you won’t miss the headaches of debugging why your DAG doesn’t even show up. But if you are an Airflow master, Ludis also supports writing your own DAG definition files.
‍
By default, everything in Ludis is backed up to a customer owned github repository. When you deploy (or redeploy) a workflow, we make sure to put the files in the right place so everything runs smoothly.
Ludis has a lot of pros… but it does require a paid subscription.
‍
We spin up dedicated private clusters for each customer, and this costs money.
That was a lot of information, so there’s really only one question left.
Ideally, your use-cases are not time sensitive. If the machine has any issues, your scripts may not run and you don’t get any notifications.
‍
Cron becomes painful as more people use it. Schedules can easily overlap and there is no security once someone has access to the scripts folder.
Technically, Airflow is free.
‍
Unfortunately there is no such thing as a free lunch.
‍
With the amount of engineering time it takes to get working, you will need a decent budget for just the R&D to test if this is the right solution for you. Your data scientists and analysts will also need to be fairly technical because they will be manually changing system files.
Collaboration is an important part of your team dynamic.
‍
Your team members share code with each other or work on different pieces of the same projects. Or you have consultants who create code that your team may not know how to run.
‍
Your scripts *need* to run when you scheduled them to run. If there are issues, you need to know ASAP.
‍
You are writing more than just scheduled jobs in isolation. Most of the time, these scripts are powering dashboards. Ludis is especially useful if you want to deploy dashboards and keep them up to date.
‍
If you don’t want your data scientists wasting hundreds of hours on data infrastructure, Ludis is probably right for you. Pricing depends on a variety of factors, but will be less expensive than having a dedicated data infrastructure team.
‍
Reach out to info@ludisanalytics.com to get a discounted trial account and see if this is the right fit for you!
‍