[HN Gopher] Nextflow: Data-Driven Computational Pipelines
___________________________________________________________________
Nextflow: Data-Driven Computational Pipelines
Author : brianzelip
Score : 58 points
Date : 2023-08-08 14:03 UTC (2 days ago)
(HTM) web link (nextflow.io)
(TXT) w3m dump (nextflow.io)
| anyoneamous wrote:
| It's kind of a shame this is based on Groovy, rather than Python
| which is much more familiar to people in the HCLS space. I've
| always been stuck on the fence between wanting to use NF (since
| it's the most popular) and Snakemake (which feels like less of an
| oddity development-wise).
| bafe wrote:
| Perhaps it is not as popular, but I found the groovy syntax
| ideal for DSL like that
| dekhn wrote:
| In theory I want to like Nextflow, but my main criticism is that
| it's really, really hard to debug programs that pass around lists
| of Promises (nextflow's dag uses promises as handles on future
| computations, and functions receive promises and can't easily
| materialize them and print them.
|
| The caching is often more trouble than it's worth. Also, the
| little bash scripts that integrate with AWS break if your AWS
| environment isn't vanilla (our enterprise AWS has a lot of
| restrictions).
| nonrepeating wrote:
| Using Tower for Nextflow can help streamline it on AWS, it's
| pretty powerful (but costs money for anything beyond trivial
| use cases): tower.nf
| adolph wrote:
| Another day, another workflow DSL. * looks like
| yaml * has curly braces to look programmery *
| whitespace might be meaningful * has pipes like a bash
| script
|
| https://www.commonwl.org/
|
| https://github.com/common-workflow-language/common-workflow-...
|
| mea culpa: The above was based on a first look at something
| titled "A DSL for parallel and and scalable computational
| pipelines"a as opposed to "Java workflow manager with Groovy
| language scripting." The presented screenshot still looks to me
| like an unholy union of yaml, js, py and sh. If that sounds
| groovy to you; have fun.
| esafak wrote:
| In all fairness, they predate the competition (2013):
| https://github.com/nextflow-io/nextflow/releases?page=25
| geoffjentry wrote:
| As GP referenced CWL, while NF had appeared first in terms of
| the bioinformatics world Nextflow, CWL, Snakelike, and WDL
| all erupted close enough to each other to be equal-ish. The
| people were aware of each other but they were all so nascent
| that it wasn't clear if it was worth joining in or not. At
| the end of the day these all came from groups trying to
| scratch particular itches, and not everyone agreed on the
| right way to scratch.
|
| However all of them were rejections of prior models as well
| as the workflow solutions prominent in the business space.
| adolph wrote:
| Yeah, the thing that I find disappointing is that there is
| a lot of science value locked into the different systems of
| describing a workflow, pipeline or DAG. Like you said, they
| all had different itches to scratch and even some barebones
| "standards" like csv have flavors/extensions/etc.
| bafe wrote:
| They try to address similar solutions, but comparing
| snakemake and nextflow doesn't do either tool a favour.
| They use different computation models, nextflow is based on
| dataflow programming and therefore schedules processes
| dynamically as new data comes in, while snakemake is pull-
| based and schedules the processes based on the dag defined
| by the dependencies. Anyhow they are both great tools.
| mbreese wrote:
| In fairness, this is an old problem with many other
| contenders. This issue is as old as batch schedulers. FWIW, I
| was at an ISMB conference in 2005 that had at least 2-3
| workflow managers presented.
| bafe wrote:
| If you refer to nextflow,the syntax is basically groovy
| adolph wrote:
| Interesting. From the docs:
|
| _The Nextflow scripting language is an extension of the
| Groovy programming language. Groovy is a powerful programming
| language for the Java virtual machine. The Nextflow syntax
| has been specialized to ease the writing of computational
| pipelines in a declarative manner._
|
| https://www.nextflow.io/docs/latest/script.html?highlight=gr.
| ..
| notQuiteEither wrote:
| Indeed, it's a fully functional scripting language that is an
| extension of groovy. So "looks programmery" is more than a
| little reductive.
|
| Edit: functional as in, it's not half-hearted, not functional
| as in functional programming.
| firecraker wrote:
| So my question to the non bioinformatics - is this already a
| solved problem?
|
| You have tasks which require resources based on the input
| parameters, these are run in docker containers to ensure the
| environment and you want to track the output of each step. Often
| these are embarrassingly parallel operations (e.g. I have 200
| samples to do the same thing on).
|
| Something like dask perhaps,but can specify a docker image for
| the task?
|
| What is the goto in DevOps for similar tasks? GitHub actions
| comes pretty close...
|
| To bioinformatics what is the unique selling point of next flow
| over say wdl/Cromwell?
| radus wrote:
| I've considered using Nextflow for bioinformatics pipelines but
| have yet to take the plunge.
|
| At work, I develop a proteomics pipeline that is composed of
| huey1 tasks (Python library; simple alternative to Celery) which
| either use subprocess to call out to some external tool, or are
| just pure python. It runs in a worker container which is managed
| by Docker swarm, and all containers pull jobs from redis. For our
| scale, it works great. However, I don't have control over the
| resource utilization of individual steps, and in the past I've
| had issues with the pipeline blocking as a result of how I was
| chaining tasks together. I think something like Nextflow would
| remove these limitations, but one thing I think I would miss is
| the ability to debug individual pipeline steps locally with an
| interactive debugger. As far as I can tell, Nextflow has
| logging/tracing facilities but nothing quite like an interactive
| debugger. I'd be happy to be told I'm wrong, or even that I'm
| doing it wrong.
|
| Other reasons I'd like to start using Nextflow:
|
| - my homebrew pipeline would be easier to setup/share
|
| - there are some efforts in the proteomics community to develop
| Nextflow pipelines (eg. QuantMS2). I think it would to have a
| shared language to express pipelines, and it would make
| benchmarking simpler.
|
| ___
|
| 1 https://github.com/coleifer/huey/
|
| 2 https://docs.quantms.org/en/latest/
| nonrepeating wrote:
| The closest I've gotten to local debugging is having the Python
| scripts that are launched by NextFlow steps connect to a remote
| debugger process ("remote" but running on the same
| workstation). PyCharm makes this fairly painless to
| orchestrate. I've never been able to debug thr Groovy script in
| a Nextflow pipeline itself; I think you'd need a debug build of
| the nextflow executable for that.
| suslik wrote:
| I develop bioinformatics pipelines for a living and am very
| opinionated on the topic.
|
| Having enough experience with snakemake as well as nextflow in
| production for many years now, I would always opt out for
| snakemake for anything but extremely large DAgs (which is quite
| rare for for bioinformatics pipelines). The fact that nextflow
| still doesn't allow deleting temporary files during execution or
| re-rerunning the workflow from a set of intermediary files is an
| insane deal breaker (not mentioning other strange things like an
| arbitrary limit of 1000 parallel jobs etc.). AWS runners that
| once were nextflow's selling point is not an advantage anymore
| give Amazon Genomics Cli.
|
| Subjectively, writing pipelines that incorporate conditional
| logic is much nicer in python+snakemake than groovy+nextflow, but
| maybe there is someone out there who prefers groovy.
|
| Is a proper dry run possible in nextflow already btw?
| mbreese wrote:
| I do too.. and have similar opinions. I wrote my own tool years
| back for pipelines because it was always frustrating (started
| roughly around the same time as Nextflow).
|
| Allowing for files to be marked as transient (temp) and re-
| running from arbitrary time points are definitely one of the
| things I support... as is conditional logic within the pipeline
| for job definition and resource usage. For me though, one of
| the biggest things is that I like having composable pipelines,
| so each part of the larger workflow can be developed
| independently. They can interact with each other (DAG) and use
| existing dependencies, but they don't have to exist in the same
| document/script. I work on large WGS datasets, so 1000's of
| jobs per patient isn't uncommon.
|
| Happy to talk more if you're interested.
|
| https://github.com/compgen-io/cgpipe
|
| And yes, you can dry run the entire thing. It will write out a
| bash script if you want to see exactly what is going to run
| without submitting jobs. It's a full language for pipelines,
| but heavily inspired by Makefiles (with conditional logic).
| mribeirodantas wrote:
| It's been a while since you can rerun/resume Nextflow
| pipelines, and yes, you can have dry runs in Nextflow. I have
| no idea what you're referring to with the 'arbitrary limit of
| 1000 parallel jobs' though. As for deleting temporary files,
| there are features that allow you to do a few things related to
| that, and other features being implemented.
___________________________________________________________________
(page generated 2023-08-10 23:00 UTC)