[HN Gopher] Replacing cron jobs with a centralized task scheduler
___________________________________________________________________
Replacing cron jobs with a centralized task scheduler
Author : tlf
Score : 146 points
Date : 2025-07-28 18:23 UTC (4 days ago)
(HTM) web link (mayhul.com)
(TXT) w3m dump (mayhul.com)
| _wire_ wrote:
| Next thing you know you'll have systemd.
| datadrivenangel wrote:
| Or worse, airflow!
| UltraSane wrote:
| Airflow can be frustrating but when it works it is so
| satisfying.
| globular-toast wrote:
| I think mistaking Airflow for a mere "task scheduler" is
| part of that frustration.
| flakes wrote:
| After using Argo Workflows, I don't think I will ever
| return to Airflow. Kubernetes is not an easy system to
| manage, but managing an Airflow setup is somehow worse. The
| story around disaster recovery and scheduler redundancy was
| an absolute nightmare for me.
| datadrivenangel wrote:
| Argo workflows is much more painful for data processing
| than Airflow in my experience.
| flakes wrote:
| It's a tradeoff. Ease of modeling the pipelines vs ease
| of managing the infrastructure. Im not really a fan of
| either syntax for defining DAGs, but they're the best
| options out there imo.
| d00mB0t wrote:
| You forgot D-Bus.
| emchammer wrote:
| [flagged]
| dang wrote:
| " _Please don 't post shallow dismissals, especially of other
| people's work. A good critical comment teaches us something._"
|
| https://news.ycombinator.com/newsguidelines.html
| gnat wrote:
| I find the best comments here to be ones where people use their
| knowledge and experience to discuss the relative strengths and
| weaknesses of the technology in the post. I see a bunch of short
| single-sentence comments here that add no value.
|
| For my part, I see this pattern repeatedly at different places.
| The raw tools in the platforms are too codey and the third-party
| frameworks like Temporal seem overkill, so you build a scheduler
| and need to solve the problems OP did: only run once, know if it
| errored, etc.
|
| But it's amazing how "it's firing off a basic action!" becomes a
| script, then becomes a script composed of reusable actions that
| can pick up where they left off in case of errors ... Over time
| your "it's just enough for us!" feature creeps towards the
| framework's functionality.
|
| I'd be curious to know how long the OP's solution stays simple
| before it submits to the feature creep demands. (Long may
| complexity be fought off, though! Every day you can live without
| the complexity of full workflows is a blessing)
| ants_everywhere wrote:
| Cloud companies also provide globe-scale cronjobs that work a
| lot like a Unix cronjob. Arguably less mental overhead than
| adopting a separate framework.
|
| And such a service provides reliability guarantees.
|
| If I have to do a reliable periodic service, my go-to is a
| kubernetes cronjob, which is like a baby version of a cloud
| cronjob. I'd be reluctant to adopt some sort of task queue
| framework because of the complexity of the mental model plus
| the complexity of keeping one more thing running reliably. K8s
| is already running reliably, I might as well use that.
| aloha2436 wrote:
| Maybe I'm just lucky to work at a place with good tools, but in
| my experience Temporal isn't super heavyweight to use compared
| to building your own even-very-simple scheduler.
|
| And it's worth it because now you have Temporal, which is the
| bees knees as far as I'm concerned. I will gladly sing praises
| of any tool that saves me getting paged, and Temporal has that
| in spades.
| booi wrote:
| second temporal. plus it gives you more freedom to write jobs
| in different languages... not that you would or should in
| most cases but there's definitely good reasons
| cyberpunk wrote:
| Don't do it onprem unless you want to spend six figures
| monthly on cassandra database nodes for pretty shit
| performance and face constant saas upselling and then
| discover how hard it is to migrate off of.
|
| Write your own scheduler.
|
| Oracle is cheaper in the long run.
| vasco wrote:
| The pragmatic answer is Jenkins. Always has been.
| flakes wrote:
| Jenkins is a place where you can be safe for a long time,
| however, it starts to break down at scale. I see it time
| after time for these batch workflow jobs. At the start, jobs
| run in seconds and everyone is happy.
|
| Over time, jobs start taking long enough to the point where
| you need to split them. Separate jobs are assigned slices of
| the original batch. Eventually, there are so many slices that
| you make a Jenkins job where the sole responsibility is
| firing off these individual jobs.
|
| Then you start hitting the real painpoints in Jenkins. Poor
| allocation of jobs across your nodes/agents, often
| overloading CPU/Mem on machines, and you struggle to manage
| the ungodly interface that is the Jenkins REST endpoint. You
| install many Jenkins addons to try and address the scheduling
| problems, and end up with a team dedicated to managing this
| Jenkins infrastructure.
|
| The scaling struggles continue to amass and you end up
| needing separate Jenkins instances to battle the load. Any
| attempt at replacing the Jenkins infrastructure goes on
| standstill, as the amount of random scripts found in
| Jenkinsfiles has created an insurmountable vendor lock-in.
|
| You read a post about a select-for-update job scheduler and
| reflect on simpler times. You cry as you refactor your
| Jenkins Groovy DSL.
| dijit wrote:
| it's actually much more common than you think for people to
| reuse CI systems for cron tasking.
|
| It's always a mistake, but it's easy in the moment and
| sticks around longer than I'd like.
| theshrike79 wrote:
| CI systems like Jenkins are there and they're corp-
| approved.
|
| Getting a weird 3rd party scheduling system with access
| to internal stuff approved is HARD in big corps.
|
| So we (ab)use the CI system we have. It has scheduling
| and it already accesses internal resources.
| tunesmith wrote:
| What's the thing you should replace Jenkins with at scale?
| msgodel wrote:
| Jenkins is terrible for just about everything. Cron has real
| problems but at least you can version control the crontab.
| Jenkins is fat, hard to work with since you'll just have one
| shared instance, and everything is burred in special objects
| hidden behind a very unergonomic and undiscoverable web GUI.
| meatmanek wrote:
| Why use a 1 minute cron job to run the tasks, instead of a
| continuously-running queue worker (or several)?
| o11c wrote:
| Back in the day, the reason I had 1-minute cron jobs (with
| flock of course) was because "what if the bespoke daemon gets
| killed somehow?" We also used screen/tmux a lot, but only for
| stuff that could afford to wait until somebody poked it (often,
| because if it repeatedly crashed the cause was likely novel and
| would need investigation).
|
| Systemd has been a _game-changer_ for small-scale deployments.
| entropie wrote:
| > Systemd has been a game-changer for small-scale
| deployments.
|
| The deep integration into nixos made me feel the same. You
| sound like you could enjoy a bit nix too.
| o11c wrote:
| I dabbled a little with Nixos a while back (e.g. I think I
| reported the bug that broke the entire point of /etc/os-
| release for chroots, as well as commented on how to do a
| container install from scratch at a point when nobody
| documented it), but there were 3 things that really pushed
| me away: 1. Nix has clear advantages for
| *deployment* (including end-user deployment) but really
| gets in the way for new *development*. Maybe flakes fix
| this? Maybe not though. 2. The "Nix on other Linux"
| install scripts were hostile in attacking startup scripts,
| rather than allowing opt-in isolation. 3. The Nix
| language (and library?) is not sane. Nobody actually
| understands it, only copy-pastes pieces of existing package
| scripts and hopes the changes work.
| MadnessASAP wrote:
| > 3. The Nix language (and library?) is not sane. Nobody
| actually understands it, only copy-pastes pieces of
| existing package scripts and hopes the changes work.
|
| Perhaps Nix is "Wonko the Sane" and it is in fact the
| rest of us who are in the asylum?
|
| Nix, the language, is a little strange at first but
| really does make sense. Nixpkgs, the "standard library",
| is a little stranger and sometimes makes an odd default
| choice. The nice thing though is that using Nix you can
| coerce Nixpkgs into just about any shape that suits you.
| anitil wrote:
| > Systemd has been a game-changer for small-scale
| deployments.
|
| Why is this? My only memory of systemd was slightly better
| configurations for sequencing the start of processes that
| depended on the completion of earlier processes so I'm a bit
| rusty.
| SteveNuts wrote:
| Systemd has timers now which have way better error
| handling.
| pjmlp wrote:
| Which is kind of ironic, given that systemd basically brings
| into Linux system services management from other UNIXes,
| Windows, mainframes and micros, but still gets plenty of
| hate.
| JdeBP wrote:
| It's folk wisdom, generated by a long line of people who did
| not have proper daemon management despite such tooling having
| been available since the 1990s. Any sort of service management,
| from running things once at bootstrap to having a long-running
| service, becomes hammered into the shape of a cron job.
|
| There are _loads_ of people over the years who have reached for
| cron instead of reaching for proper general-purpose daemon
| management (SRC, SMF, daemontools, runit, daemontools-encore,
| perp, s6, ...). It is on Stack Exchange answers and in people
| 's personal "How I did this" articles on WWW sites. (Although
| the idea goes back to the Usenet era.) It became one of those
| practices perpetuated because other people did it.
|
| The next step is always discovering that cron's error handling
| and logging are aimed at an era when the system operator sat in
| the console room, and received "You have new mail"
| notifications at the console shell prompt.
|
| And the step after that is (re-)discovering that the anacron
| approach does not fully cut the mustard. (-:
| slightwinder wrote:
| Single scripts are easier coded and can be more loosely, as you
| don't have to look out for sneaky memory-leaks and other
| problems which might emerge in long-running tasks. There is
| also no need to build and maintain a bespoke framework for
| managing your multiple jobs. This avoids mental debt for the
| devs. If you have many jobs, from multiple devs, it's the more
| pragmatic solution.
| shawn_w wrote:
| Isn't a "centralized task scheduler" pretty much what cron is?
| somat wrote:
| I was going to guess the author needed something that unified
| the task scheduling across a distributed system of computers.
| But that requirement is never mentioned in the article. And
| they still use cron to call their new scheduler... So unless I
| am missing something they did not replace cron al all, they
| just rewrote their scheduled jobs to use a common library and
| have more robust error handling.
| UltraSane wrote:
| centralized for many computers.
| eschaton wrote:
| It's not even a centralized task scheduler on its native UNIX:
| iI's a centralized *userspace* task scheduler.
|
| Mainframe and minicomputer operating systems support scheduling
| in the operating system itself, as part of their process/thread
| scheduler; their native queuing systems are built on top of the
| primitives their scheduler offers, for proper accounting and
| maximum resource utilization (including prioritization).
|
| Only UNIX would just provide a way to run processes at a
| specified time or interval and call the job done.
| JdeBP wrote:
| Although you're right that Unix never really reached having
| the full three-level scheduling mechanisms of the mainframe
| operating systems, cron is not the actual Unix parallel of
| the high-level scheduler that keeps the running jobs list
| fed.
|
| That is in fact batch (and atrun, although that's considered
| an implementation detail).
|
| * https://pubs.opengroup.org/onlinepubs/9799919799/utilities/
| b...
|
| Most implementations flesh out the "implementation-defined
| algorithms" stuff to be calculations based upon load
| averages, as on NetBSD.
|
| * https://man.netbsd.org/batch.1
|
| * https://man.netbsd.org/atrun.8
|
| Or fairly primitive parallelism limits as on Illumos.
|
| * https://illumos.org/man/1/batch
|
| * https://illumos.org/man/5/queuedefs
|
| Not quite JECL, is it? (-:
| ptx wrote:
| It's lacking a convenient way to queue a task and inspect the
| task queue, but "at" (at/atq/atrm) provides exactly the "single
| cron job responsible for executing scheduled tasks that runs
| once every minute" that the author was looking for.
| burnt-resistor wrote:
| Jobs that need retries, atomicity, monitoring, rescheduling, ad
| hoc scheduling, and flexibility probably aren't suited to most
| cron servers.
|
| Beanstalkd, cronicle, agenda, sidekiq, faktory, celery, etc. are
| the usual suspects.
|
| What is often missing is HA of the controller service process.
| sfortis wrote:
| Chronicle is a lifesaver. HA, clustering, API, clean UI, it's
| doing everything right. I'm using this also as an API wrapper
| for Bash and Python scripts.
|
| https://github.com/jhuckaby/Cronicle/blob/master/docs/Setup....
| mrweasel wrote:
| I'd probably even add systemd timers to that list. It does most
| of what you list, minus the retries (but I think you could
| handle that in the service definition)
| sontek wrote:
| I love this solution, I've implemented a very similar task
| scheduler at many companies.
|
| I do think the _best_ solution for this is still RabbitMQ. It has
| the ability to push tasks in the queue and tell it to run at a
| very specific time called "Delayed Messages" and then it just
| processes them at that time.
| pokstad wrote:
| Temporal.io is made for this
| halamadrid wrote:
| Unmeshed.io is another alternative. You don't even need to
| write code for your schedules
| theshrike79 wrote:
| Paying $500 a month for cron just seems wrong.
|
| And adds an external dependency for something very essential.
| pokstad wrote:
| You can run it yourself for free
| pjmlp wrote:
| If they are using AWS, why not use what AWS already has, battle
| tested for task scheduling functions?
| nevon wrote:
| I've built something similar as a service to be used by
| developers at a large-ish enterprise. Granted, it was based on
| functionality offered by AWS, but the users didn't really know
| that.
|
| The reason we built it, despite the fact that developers could
| very well have deployed a CloudWatch EventBridge schedule + SQS
| + lambda or similar, is because they never did. They would
| consistently choose to build it into their existing services,
| which were rarely if ever handling things like limiting
| concurrency if a task took too long, emitting metrics on
| success/failure/duration, audit logging for when a task had to
| be manually triggered for some reason. If I had to guess, I
| think the reason was because it allowed them to piggyback on
| existing change controls and "just write application code"
| instead of having to think about additional pieces of
| infrastructure.
|
| If I could do it again, I would probably have reached for
| something like Temporal, even though it seemed overkill for
| what we initially set out to do. It took about a week before
| people started asking for locking and retries.
| guappa wrote:
| So that they can drop AWS
| pjmlp wrote:
| It is a bit hard when they rely on AWS message queues for the
| implementation.
| mrweasel wrote:
| If you're running on AWS and not designing a system that
| locks you in to the AWS platform, then you're going to be
| overpaying by a lot.
| UltraSane wrote:
| The Windows Task Scheduler is actually very nice and powerful.
| One cool trick is to have a task triggered by a windows event.
| jiggunjer wrote:
| Aka workflow orchestrator, pipeline manager, process runner,
| automation tool.
|
| It's not clear if they used a product or DIY solution. The nice
| thing many existing products offer is a web UI and a database.
| Felk wrote:
| I see that the author took a 'heuristical' approach for retrying
| tasks (having a predetermined amount of time a task is expected
| to take, and consider it failed if it wasn't updated in time) and
| uses SQS. If the solution is homemade anyway, I can only
| recommend leveraging your database's transactionality for this,
| which is a common pattern I have often seen recommend and also
| successfully used myself:
|
| - At processing start, update the schedule entry to 'executing',
| then open a new transansaction and lock it, while skipping
| already locked tasks (`SELECT FOR UPDATE ... SKIP LOCKED`).
|
| - At the end of processing, set it to 'COMPLETED' and commit.
| This also releases the lock.
|
| This has the following nice characteristics:
|
| - You can have parallel processors polling tasks directly from
| the database without another queueing mechanism like SQS, and
| have no risk of them picking the same task.
|
| - If you find an unlocked task in 'executing', you know the
| processor died for sure. No heuristic needed
| alex5207 wrote:
| This is exactly what we're doing. Works like a charm.
| diarrhea wrote:
| This introduces long-running transactions, which at least in
| Postgres should be avoided.
| danielheath wrote:
| Depends what else you're running on it; it's a little
| expensive, but not prohibitively so.
| maxbond wrote:
| Long running transactions interfere with vacuuming and
| increase contention for locks. Everything depends on your
| workload but a long running transactions holding an
| important lock is an easy way to bring down production.
| iglio wrote:
| If the system is already using SQS, DynamoDB has this
| locking library which is lighter weight for this use case
|
| https://github.com/awslabs/amazon-dynamodb-lock-client
|
| > The AmazonDynamoDBLockClient is a general purpose
| distributed locking library built on top of DynamoDB. It
| supports both coarse-grained and fine-grained locking.
| aitchnyu wrote:
| I read too many "use Postgres as your queue (pgkitchensink is
| in beta)", now I'm learning listen/notify is a strain, and so
| are long transactions. Is there a happy medium?
| bmn__ wrote:
| Just stop worrying and use it. If and when you actually
| bump into the limitations, then it's time to sit down and
| think and find a supplement or replacement for the
| offending part.
| rco8786 wrote:
| Excellent advice across many domains/techs here.
| thewisenerd wrote:
| t1: select for update where status=pending, set
| status=processing
|
| t2: update, set status=completed|error
|
| these are two independent, very short transactions? or am i
| misunderstanding something here?
|
| --
|
| edit:
|
| i think i'm not seeing what the 'transaction at start of
| processor' logic is; i'm thinking more of a polling logic
| while true: r := select for update if r
| is None: return sleep a bit
|
| this obviously has the drawback of knowing how long to sleep
| for; and tasks not getting "instantly" picked up, but eh,
| tradeoffs.
| maxbond wrote:
| They're proposing doing it in one transaction as a
| heartbeat.
|
| > - If you find an unlocked task in 'executing', you know
| the processor died for sure. No heuristic needed
| dthedavid wrote:
| Great work. Did you consider buying instead of building? I've
| worked at organizations that built similar systems, but what was
| often lacking was developer experience, observability, and
| scalability, basically everything outside of core functionality;
| essentially the stuff that you're trying to tack on as you
| improve your system.
|
| Now that I'm building on my own, I've thought about building as
| well, but I've found that off-the-shelf systems handle all of
| this far better (and they are opensourced too), ie trigger-dot-
| dev and many others.
| dmitry-vsl wrote:
| > We had createScheduledPosts.ts that would run every 15 minutes,
| scan our table of scheduled posts and create any that needed to
| be published.
|
| Why not set the publication_date when you create a post and have
| a function getPublishedPosts that fetches a list of posts,
| filtering out those with a publication_date earlier than the
| current date? With this approach, you don't need cron jobs at
| all.
| jon-wood wrote:
| Maybe there's a bunch of other actions that need to take place
| when a post is published, such as sending notification emails,
| or posting stuff to social media. They could of course be
| scheduled jobs in their own right, but you haven't really saved
| yourself any effort there, and now if the publishing time
| changes you've got to reschedule all those individual jobs.
| shireboy wrote:
| One gotcha with roll your own task scheduler is if you want to
| run it across multiple machines. If you need 5 machines running
| different scheduled tasks, you need a locking mechanism to ensure
| only one machine is processing the task. In the author's approach
| this is handled by the queue, but in my read the scheduler can
| only happen on one machine or you get multiple of the same task
| in the queue. Retry can get more complicated- depending on the
| failure you may want an exponential backoff, retrying N times and
| waiting longer durations between. A nice dashboard to see the
| status of everything is helpful also.
|
| In .NET world I use Hangfire for this. In Node (I assume what
| this is) I tinkered with Bull, but not sure what best in class is
| there.
| cyberpunk wrote:
| Oban enters the chat... :)
| sunshine-o wrote:
| Is there a cool lightweight alternative to cron for (at least) a
| single host?
|
| To illustrate what I am looking for, I often end up using
| supervisord [0] (but I also like immortal [1]) for process
| control when not on a systemd enabled system. In my experience
| they are reliable, lightweight and a pleasure to work with.
|
| I am looking for something similar for scheduled jobs.
|
| - [0] https://supervisord.org/
|
| - [1] https://immortal.run/
| isp wrote:
| Supercronic: https://github.com/aptible/supercronic
|
| Designed to run in a container, but should equally well work on
| a single host. However, no option for "high availability"
| running, where multiple hosts coordinate.
| justusthane wrote:
| Take a look at this comment for some options:
| https://news.ycombinator.com/item?id=44752548
| majkinetor wrote:
| I find Rundeck is great for this. Using it with hundreeds of jobs
| for a decade, with a bunch of users accessing it and checking
| logs, having retries, notifications and all enterprise thingies
| for free. Providing easy way to have GUI for scripts.
| pinko wrote:
| HTCondor is always an option. Lacks shiny tinfoil, but works like
| a tank.
| rashidae wrote:
| What happens when the DB gets large? How do you handle
| idempotency? (What if SQS delivers twice?) The cron job is still
| a single point of failure...
| qianli_cs wrote:
| Managing complex scheduled workflows at scale comes with a lot
| of nuances. This is exactly why we're building DBOS (shameless
| plug! https://github.com/dbos-inc), which provides durable cron
| jobs and exactly-once workflow triggering. Since it's just a
| library on top of Postgres, it doesn't require a centralized
| scheduler (well, think of Postgres as the coordinator).
|
| One challenge is to guarantee exactly-once processing across
| software upgrades. DBOS uses the cron-scheduled time as an
| idempotency key, and tags each workflow execution with a
| version. We also use the database transactions to guard against
| conflicting concurrent updates.
___________________________________________________________________
(page generated 2025-08-01 23:00 UTC)