[HN Gopher] Replacing cron jobs with a centralized task scheduler
       ___________________________________________________________________
        
       Replacing cron jobs with a centralized task scheduler
        
       Author : tlf
       Score  : 146 points
       Date   : 2025-07-28 18:23 UTC (4 days ago)
        
 (HTM) web link (mayhul.com)
 (TXT) w3m dump (mayhul.com)
        
       | _wire_ wrote:
       | Next thing you know you'll have systemd.
        
         | datadrivenangel wrote:
         | Or worse, airflow!
        
           | UltraSane wrote:
           | Airflow can be frustrating but when it works it is so
           | satisfying.
        
             | globular-toast wrote:
             | I think mistaking Airflow for a mere "task scheduler" is
             | part of that frustration.
        
             | flakes wrote:
             | After using Argo Workflows, I don't think I will ever
             | return to Airflow. Kubernetes is not an easy system to
             | manage, but managing an Airflow setup is somehow worse. The
             | story around disaster recovery and scheduler redundancy was
             | an absolute nightmare for me.
        
               | datadrivenangel wrote:
               | Argo workflows is much more painful for data processing
               | than Airflow in my experience.
        
               | flakes wrote:
               | It's a tradeoff. Ease of modeling the pipelines vs ease
               | of managing the infrastructure. Im not really a fan of
               | either syntax for defining DAGs, but they're the best
               | options out there imo.
        
       | d00mB0t wrote:
       | You forgot D-Bus.
        
       | emchammer wrote:
       | [flagged]
        
         | dang wrote:
         | " _Please don 't post shallow dismissals, especially of other
         | people's work. A good critical comment teaches us something._"
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
       | gnat wrote:
       | I find the best comments here to be ones where people use their
       | knowledge and experience to discuss the relative strengths and
       | weaknesses of the technology in the post. I see a bunch of short
       | single-sentence comments here that add no value.
       | 
       | For my part, I see this pattern repeatedly at different places.
       | The raw tools in the platforms are too codey and the third-party
       | frameworks like Temporal seem overkill, so you build a scheduler
       | and need to solve the problems OP did: only run once, know if it
       | errored, etc.
       | 
       | But it's amazing how "it's firing off a basic action!" becomes a
       | script, then becomes a script composed of reusable actions that
       | can pick up where they left off in case of errors ... Over time
       | your "it's just enough for us!" feature creeps towards the
       | framework's functionality.
       | 
       | I'd be curious to know how long the OP's solution stays simple
       | before it submits to the feature creep demands. (Long may
       | complexity be fought off, though! Every day you can live without
       | the complexity of full workflows is a blessing)
        
         | ants_everywhere wrote:
         | Cloud companies also provide globe-scale cronjobs that work a
         | lot like a Unix cronjob. Arguably less mental overhead than
         | adopting a separate framework.
         | 
         | And such a service provides reliability guarantees.
         | 
         | If I have to do a reliable periodic service, my go-to is a
         | kubernetes cronjob, which is like a baby version of a cloud
         | cronjob. I'd be reluctant to adopt some sort of task queue
         | framework because of the complexity of the mental model plus
         | the complexity of keeping one more thing running reliably. K8s
         | is already running reliably, I might as well use that.
        
         | aloha2436 wrote:
         | Maybe I'm just lucky to work at a place with good tools, but in
         | my experience Temporal isn't super heavyweight to use compared
         | to building your own even-very-simple scheduler.
         | 
         | And it's worth it because now you have Temporal, which is the
         | bees knees as far as I'm concerned. I will gladly sing praises
         | of any tool that saves me getting paged, and Temporal has that
         | in spades.
        
           | booi wrote:
           | second temporal. plus it gives you more freedom to write jobs
           | in different languages... not that you would or should in
           | most cases but there's definitely good reasons
        
             | cyberpunk wrote:
             | Don't do it onprem unless you want to spend six figures
             | monthly on cassandra database nodes for pretty shit
             | performance and face constant saas upselling and then
             | discover how hard it is to migrate off of.
             | 
             | Write your own scheduler.
             | 
             | Oracle is cheaper in the long run.
        
         | vasco wrote:
         | The pragmatic answer is Jenkins. Always has been.
        
           | flakes wrote:
           | Jenkins is a place where you can be safe for a long time,
           | however, it starts to break down at scale. I see it time
           | after time for these batch workflow jobs. At the start, jobs
           | run in seconds and everyone is happy.
           | 
           | Over time, jobs start taking long enough to the point where
           | you need to split them. Separate jobs are assigned slices of
           | the original batch. Eventually, there are so many slices that
           | you make a Jenkins job where the sole responsibility is
           | firing off these individual jobs.
           | 
           | Then you start hitting the real painpoints in Jenkins. Poor
           | allocation of jobs across your nodes/agents, often
           | overloading CPU/Mem on machines, and you struggle to manage
           | the ungodly interface that is the Jenkins REST endpoint. You
           | install many Jenkins addons to try and address the scheduling
           | problems, and end up with a team dedicated to managing this
           | Jenkins infrastructure.
           | 
           | The scaling struggles continue to amass and you end up
           | needing separate Jenkins instances to battle the load. Any
           | attempt at replacing the Jenkins infrastructure goes on
           | standstill, as the amount of random scripts found in
           | Jenkinsfiles has created an insurmountable vendor lock-in.
           | 
           | You read a post about a select-for-update job scheduler and
           | reflect on simpler times. You cry as you refactor your
           | Jenkins Groovy DSL.
        
             | dijit wrote:
             | it's actually much more common than you think for people to
             | reuse CI systems for cron tasking.
             | 
             | It's always a mistake, but it's easy in the moment and
             | sticks around longer than I'd like.
        
               | theshrike79 wrote:
               | CI systems like Jenkins are there and they're corp-
               | approved.
               | 
               | Getting a weird 3rd party scheduling system with access
               | to internal stuff approved is HARD in big corps.
               | 
               | So we (ab)use the CI system we have. It has scheduling
               | and it already accesses internal resources.
        
             | tunesmith wrote:
             | What's the thing you should replace Jenkins with at scale?
        
           | msgodel wrote:
           | Jenkins is terrible for just about everything. Cron has real
           | problems but at least you can version control the crontab.
           | Jenkins is fat, hard to work with since you'll just have one
           | shared instance, and everything is burred in special objects
           | hidden behind a very unergonomic and undiscoverable web GUI.
        
       | meatmanek wrote:
       | Why use a 1 minute cron job to run the tasks, instead of a
       | continuously-running queue worker (or several)?
        
         | o11c wrote:
         | Back in the day, the reason I had 1-minute cron jobs (with
         | flock of course) was because "what if the bespoke daemon gets
         | killed somehow?" We also used screen/tmux a lot, but only for
         | stuff that could afford to wait until somebody poked it (often,
         | because if it repeatedly crashed the cause was likely novel and
         | would need investigation).
         | 
         | Systemd has been a _game-changer_ for small-scale deployments.
        
           | entropie wrote:
           | > Systemd has been a game-changer for small-scale
           | deployments.
           | 
           | The deep integration into nixos made me feel the same. You
           | sound like you could enjoy a bit nix too.
        
             | o11c wrote:
             | I dabbled a little with Nixos a while back (e.g. I think I
             | reported the bug that broke the entire point of /etc/os-
             | release for chroots, as well as commented on how to do a
             | container install from scratch at a point when nobody
             | documented it), but there were 3 things that really pushed
             | me away:                 1. Nix has clear advantages for
             | *deployment* (including end-user deployment) but really
             | gets in the way for new *development*. Maybe flakes fix
             | this? Maybe not though.       2. The "Nix on other Linux"
             | install scripts were hostile in attacking startup scripts,
             | rather than allowing opt-in isolation.       3. The Nix
             | language (and library?) is not sane. Nobody actually
             | understands it, only copy-pastes pieces of existing package
             | scripts and hopes the changes work.
        
               | MadnessASAP wrote:
               | > 3. The Nix language (and library?) is not sane. Nobody
               | actually understands it, only copy-pastes pieces of
               | existing package scripts and hopes the changes work.
               | 
               | Perhaps Nix is "Wonko the Sane" and it is in fact the
               | rest of us who are in the asylum?
               | 
               | Nix, the language, is a little strange at first but
               | really does make sense. Nixpkgs, the "standard library",
               | is a little stranger and sometimes makes an odd default
               | choice. The nice thing though is that using Nix you can
               | coerce Nixpkgs into just about any shape that suits you.
        
           | anitil wrote:
           | > Systemd has been a game-changer for small-scale
           | deployments.
           | 
           | Why is this? My only memory of systemd was slightly better
           | configurations for sequencing the start of processes that
           | depended on the completion of earlier processes so I'm a bit
           | rusty.
        
             | SteveNuts wrote:
             | Systemd has timers now which have way better error
             | handling.
        
           | pjmlp wrote:
           | Which is kind of ironic, given that systemd basically brings
           | into Linux system services management from other UNIXes,
           | Windows, mainframes and micros, but still gets plenty of
           | hate.
        
         | JdeBP wrote:
         | It's folk wisdom, generated by a long line of people who did
         | not have proper daemon management despite such tooling having
         | been available since the 1990s. Any sort of service management,
         | from running things once at bootstrap to having a long-running
         | service, becomes hammered into the shape of a cron job.
         | 
         | There are _loads_ of people over the years who have reached for
         | cron instead of reaching for proper general-purpose daemon
         | management (SRC, SMF, daemontools, runit, daemontools-encore,
         | perp, s6, ...). It is on Stack Exchange answers and in people
         | 's personal "How I did this" articles on WWW sites. (Although
         | the idea goes back to the Usenet era.) It became one of those
         | practices perpetuated because other people did it.
         | 
         | The next step is always discovering that cron's error handling
         | and logging are aimed at an era when the system operator sat in
         | the console room, and received "You have new mail"
         | notifications at the console shell prompt.
         | 
         | And the step after that is (re-)discovering that the anacron
         | approach does not fully cut the mustard. (-:
        
         | slightwinder wrote:
         | Single scripts are easier coded and can be more loosely, as you
         | don't have to look out for sneaky memory-leaks and other
         | problems which might emerge in long-running tasks. There is
         | also no need to build and maintain a bespoke framework for
         | managing your multiple jobs. This avoids mental debt for the
         | devs. If you have many jobs, from multiple devs, it's the more
         | pragmatic solution.
        
       | shawn_w wrote:
       | Isn't a "centralized task scheduler" pretty much what cron is?
        
         | somat wrote:
         | I was going to guess the author needed something that unified
         | the task scheduling across a distributed system of computers.
         | But that requirement is never mentioned in the article. And
         | they still use cron to call their new scheduler... So unless I
         | am missing something they did not replace cron al all, they
         | just rewrote their scheduled jobs to use a common library and
         | have more robust error handling.
        
         | UltraSane wrote:
         | centralized for many computers.
        
         | eschaton wrote:
         | It's not even a centralized task scheduler on its native UNIX:
         | iI's a centralized *userspace* task scheduler.
         | 
         | Mainframe and minicomputer operating systems support scheduling
         | in the operating system itself, as part of their process/thread
         | scheduler; their native queuing systems are built on top of the
         | primitives their scheduler offers, for proper accounting and
         | maximum resource utilization (including prioritization).
         | 
         | Only UNIX would just provide a way to run processes at a
         | specified time or interval and call the job done.
        
           | JdeBP wrote:
           | Although you're right that Unix never really reached having
           | the full three-level scheduling mechanisms of the mainframe
           | operating systems, cron is not the actual Unix parallel of
           | the high-level scheduler that keeps the running jobs list
           | fed.
           | 
           | That is in fact batch (and atrun, although that's considered
           | an implementation detail).
           | 
           | * https://pubs.opengroup.org/onlinepubs/9799919799/utilities/
           | b...
           | 
           | Most implementations flesh out the "implementation-defined
           | algorithms" stuff to be calculations based upon load
           | averages, as on NetBSD.
           | 
           | * https://man.netbsd.org/batch.1
           | 
           | * https://man.netbsd.org/atrun.8
           | 
           | Or fairly primitive parallelism limits as on Illumos.
           | 
           | * https://illumos.org/man/1/batch
           | 
           | * https://illumos.org/man/5/queuedefs
           | 
           | Not quite JECL, is it? (-:
        
         | ptx wrote:
         | It's lacking a convenient way to queue a task and inspect the
         | task queue, but "at" (at/atq/atrm) provides exactly the "single
         | cron job responsible for executing scheduled tasks that runs
         | once every minute" that the author was looking for.
        
       | burnt-resistor wrote:
       | Jobs that need retries, atomicity, monitoring, rescheduling, ad
       | hoc scheduling, and flexibility probably aren't suited to most
       | cron servers.
       | 
       | Beanstalkd, cronicle, agenda, sidekiq, faktory, celery, etc. are
       | the usual suspects.
       | 
       | What is often missing is HA of the controller service process.
        
         | sfortis wrote:
         | Chronicle is a lifesaver. HA, clustering, API, clean UI, it's
         | doing everything right. I'm using this also as an API wrapper
         | for Bash and Python scripts.
         | 
         | https://github.com/jhuckaby/Cronicle/blob/master/docs/Setup....
        
         | mrweasel wrote:
         | I'd probably even add systemd timers to that list. It does most
         | of what you list, minus the retries (but I think you could
         | handle that in the service definition)
        
       | sontek wrote:
       | I love this solution, I've implemented a very similar task
       | scheduler at many companies.
       | 
       | I do think the _best_ solution for this is still RabbitMQ. It has
       | the ability to push tasks in the queue and tell it to run at a
       | very specific time called  "Delayed Messages" and then it just
       | processes them at that time.
        
       | pokstad wrote:
       | Temporal.io is made for this
        
         | halamadrid wrote:
         | Unmeshed.io is another alternative. You don't even need to
         | write code for your schedules
        
         | theshrike79 wrote:
         | Paying $500 a month for cron just seems wrong.
         | 
         | And adds an external dependency for something very essential.
        
           | pokstad wrote:
           | You can run it yourself for free
        
       | pjmlp wrote:
       | If they are using AWS, why not use what AWS already has, battle
       | tested for task scheduling functions?
        
         | nevon wrote:
         | I've built something similar as a service to be used by
         | developers at a large-ish enterprise. Granted, it was based on
         | functionality offered by AWS, but the users didn't really know
         | that.
         | 
         | The reason we built it, despite the fact that developers could
         | very well have deployed a CloudWatch EventBridge schedule + SQS
         | + lambda or similar, is because they never did. They would
         | consistently choose to build it into their existing services,
         | which were rarely if ever handling things like limiting
         | concurrency if a task took too long, emitting metrics on
         | success/failure/duration, audit logging for when a task had to
         | be manually triggered for some reason. If I had to guess, I
         | think the reason was because it allowed them to piggyback on
         | existing change controls and "just write application code"
         | instead of having to think about additional pieces of
         | infrastructure.
         | 
         | If I could do it again, I would probably have reached for
         | something like Temporal, even though it seemed overkill for
         | what we initially set out to do. It took about a week before
         | people started asking for locking and retries.
        
         | guappa wrote:
         | So that they can drop AWS
        
           | pjmlp wrote:
           | It is a bit hard when they rely on AWS message queues for the
           | implementation.
        
           | mrweasel wrote:
           | If you're running on AWS and not designing a system that
           | locks you in to the AWS platform, then you're going to be
           | overpaying by a lot.
        
       | UltraSane wrote:
       | The Windows Task Scheduler is actually very nice and powerful.
       | One cool trick is to have a task triggered by a windows event.
        
       | jiggunjer wrote:
       | Aka workflow orchestrator, pipeline manager, process runner,
       | automation tool.
       | 
       | It's not clear if they used a product or DIY solution. The nice
       | thing many existing products offer is a web UI and a database.
        
       | Felk wrote:
       | I see that the author took a 'heuristical' approach for retrying
       | tasks (having a predetermined amount of time a task is expected
       | to take, and consider it failed if it wasn't updated in time) and
       | uses SQS. If the solution is homemade anyway, I can only
       | recommend leveraging your database's transactionality for this,
       | which is a common pattern I have often seen recommend and also
       | successfully used myself:
       | 
       | - At processing start, update the schedule entry to 'executing',
       | then open a new transansaction and lock it, while skipping
       | already locked tasks (`SELECT FOR UPDATE ... SKIP LOCKED`).
       | 
       | - At the end of processing, set it to 'COMPLETED' and commit.
       | This also releases the lock.
       | 
       | This has the following nice characteristics:
       | 
       | - You can have parallel processors polling tasks directly from
       | the database without another queueing mechanism like SQS, and
       | have no risk of them picking the same task.
       | 
       | - If you find an unlocked task in 'executing', you know the
       | processor died for sure. No heuristic needed
        
         | alex5207 wrote:
         | This is exactly what we're doing. Works like a charm.
        
         | diarrhea wrote:
         | This introduces long-running transactions, which at least in
         | Postgres should be avoided.
        
           | danielheath wrote:
           | Depends what else you're running on it; it's a little
           | expensive, but not prohibitively so.
        
             | maxbond wrote:
             | Long running transactions interfere with vacuuming and
             | increase contention for locks. Everything depends on your
             | workload but a long running transactions holding an
             | important lock is an easy way to bring down production.
        
               | iglio wrote:
               | If the system is already using SQS, DynamoDB has this
               | locking library which is lighter weight for this use case
               | 
               | https://github.com/awslabs/amazon-dynamodb-lock-client
               | 
               | > The AmazonDynamoDBLockClient is a general purpose
               | distributed locking library built on top of DynamoDB. It
               | supports both coarse-grained and fine-grained locking.
        
           | aitchnyu wrote:
           | I read too many "use Postgres as your queue (pgkitchensink is
           | in beta)", now I'm learning listen/notify is a strain, and so
           | are long transactions. Is there a happy medium?
        
             | bmn__ wrote:
             | Just stop worrying and use it. If and when you actually
             | bump into the limitations, then it's time to sit down and
             | think and find a supplement or replacement for the
             | offending part.
        
               | rco8786 wrote:
               | Excellent advice across many domains/techs here.
        
           | thewisenerd wrote:
           | t1: select for update where status=pending, set
           | status=processing
           | 
           | t2: update, set status=completed|error
           | 
           | these are two independent, very short transactions? or am i
           | misunderstanding something here?
           | 
           | --
           | 
           | edit:
           | 
           | i think i'm not seeing what the 'transaction at start of
           | processor' logic is; i'm thinking more of a polling logic
           | while true:           r := select for update           if r
           | is None:             return           sleep a bit
           | 
           | this obviously has the drawback of knowing how long to sleep
           | for; and tasks not getting "instantly" picked up, but eh,
           | tradeoffs.
        
             | maxbond wrote:
             | They're proposing doing it in one transaction as a
             | heartbeat.
             | 
             | > - If you find an unlocked task in 'executing', you know
             | the processor died for sure. No heuristic needed
        
       | dthedavid wrote:
       | Great work. Did you consider buying instead of building? I've
       | worked at organizations that built similar systems, but what was
       | often lacking was developer experience, observability, and
       | scalability, basically everything outside of core functionality;
       | essentially the stuff that you're trying to tack on as you
       | improve your system.
       | 
       | Now that I'm building on my own, I've thought about building as
       | well, but I've found that off-the-shelf systems handle all of
       | this far better (and they are opensourced too), ie trigger-dot-
       | dev and many others.
        
       | dmitry-vsl wrote:
       | > We had createScheduledPosts.ts that would run every 15 minutes,
       | scan our table of scheduled posts and create any that needed to
       | be published.
       | 
       | Why not set the publication_date when you create a post and have
       | a function getPublishedPosts that fetches a list of posts,
       | filtering out those with a publication_date earlier than the
       | current date? With this approach, you don't need cron jobs at
       | all.
        
         | jon-wood wrote:
         | Maybe there's a bunch of other actions that need to take place
         | when a post is published, such as sending notification emails,
         | or posting stuff to social media. They could of course be
         | scheduled jobs in their own right, but you haven't really saved
         | yourself any effort there, and now if the publishing time
         | changes you've got to reschedule all those individual jobs.
        
       | shireboy wrote:
       | One gotcha with roll your own task scheduler is if you want to
       | run it across multiple machines. If you need 5 machines running
       | different scheduled tasks, you need a locking mechanism to ensure
       | only one machine is processing the task. In the author's approach
       | this is handled by the queue, but in my read the scheduler can
       | only happen on one machine or you get multiple of the same task
       | in the queue. Retry can get more complicated- depending on the
       | failure you may want an exponential backoff, retrying N times and
       | waiting longer durations between. A nice dashboard to see the
       | status of everything is helpful also.
       | 
       | In .NET world I use Hangfire for this. In Node (I assume what
       | this is) I tinkered with Bull, but not sure what best in class is
       | there.
        
         | cyberpunk wrote:
         | Oban enters the chat... :)
        
       | sunshine-o wrote:
       | Is there a cool lightweight alternative to cron for (at least) a
       | single host?
       | 
       | To illustrate what I am looking for, I often end up using
       | supervisord [0] (but I also like immortal [1]) for process
       | control when not on a systemd enabled system. In my experience
       | they are reliable, lightweight and a pleasure to work with.
       | 
       | I am looking for something similar for scheduled jobs.
       | 
       | - [0] https://supervisord.org/
       | 
       | - [1] https://immortal.run/
        
         | isp wrote:
         | Supercronic: https://github.com/aptible/supercronic
         | 
         | Designed to run in a container, but should equally well work on
         | a single host. However, no option for "high availability"
         | running, where multiple hosts coordinate.
        
         | justusthane wrote:
         | Take a look at this comment for some options:
         | https://news.ycombinator.com/item?id=44752548
        
       | majkinetor wrote:
       | I find Rundeck is great for this. Using it with hundreeds of jobs
       | for a decade, with a bunch of users accessing it and checking
       | logs, having retries, notifications and all enterprise thingies
       | for free. Providing easy way to have GUI for scripts.
        
       | pinko wrote:
       | HTCondor is always an option. Lacks shiny tinfoil, but works like
       | a tank.
        
       | rashidae wrote:
       | What happens when the DB gets large? How do you handle
       | idempotency? (What if SQS delivers twice?) The cron job is still
       | a single point of failure...
        
         | qianli_cs wrote:
         | Managing complex scheduled workflows at scale comes with a lot
         | of nuances. This is exactly why we're building DBOS (shameless
         | plug! https://github.com/dbos-inc), which provides durable cron
         | jobs and exactly-once workflow triggering. Since it's just a
         | library on top of Postgres, it doesn't require a centralized
         | scheduler (well, think of Postgres as the coordinator).
         | 
         | One challenge is to guarantee exactly-once processing across
         | software upgrades. DBOS uses the cron-scheduled time as an
         | idempotency key, and tags each workflow execution with a
         | version. We also use the database transactions to guard against
         | conflicting concurrent updates.
        
       ___________________________________________________________________
       (page generated 2025-08-01 23:00 UTC)