[HN Gopher] Seq - A programming language for computational genom...
___________________________________________________________________
Seq - A programming language for computational genomics and
bioinformatics
Author : tdido
Score : 103 points
Date : 2021-09-15 09:54 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Bostonian wrote:
| The code examples look like Python 2 rather than Python 3. Print
| does have not parentheses. Why was this decision made?
| haihaibye wrote:
| They support both print syntaxes, and will deprecate Python 2
| style soon.
|
| https://github.com/seq-lang/seq/issues/223
| [deleted]
| car wrote:
| Looks great, will definitely give this a try since it does
| sequence manipulations that I otherwise have to write myself.
|
| Will this be available via conda? And how would seq integreate
| with Snakemake, since that is also based on Python?
| tdido wrote:
| Seems like there's a conda package in the works:
| https://github.com/bioconda/bioconda-recipes/pull/29660
| dunefox wrote:
| BioJulia might be the better choice since it validates data and
| seq doesn't: https://biojulia.net/post/seq-lang/
| chmaynard wrote:
| I'm wondering if Seq can also serve as a general-purpose
| replacement for Python whenever a fast executable is needed.
| arshajii wrote:
| (I'm one of the developers on Seq.) We've actually been working
| mostly on closing the gap with Python for the last year or so.
| Seq can be useful for plain Python programs as well -- I give a
| bit more context in my comment above.
| dunefox wrote:
| It's a domain specific language for bioinformatics. So, most
| likely not.
| gandalfgeek wrote:
| Quick explainer video: https://youtu.be/5bk4Wc5Op2M
| clusterhacks wrote:
| I am a CS person who works with bioinformaticians every day as
| part of my job.
|
| I really like that Seq seems to have built-in some
| parallelization ability. I spend no small amount of time in my
| day job doing that manually in R with RcppParallel for loops that
| are totally independent across each iteration.
|
| Bioinformaticians are often educated to use a specific
| programming language and environment. They aren't usually looking
| to try other languages. For example, I support our bioinformatics
| group and they are basically 100% R and RStudio users. We have a
| single user of Python and that user is doing "typical" tensorflow
| stuff with images.
|
| I've noticed this same bias towards a single language for some
| other academic niches. Like SAS or Stata camps in public health
| or psychology - I think of these languages as basically the same,
| but for non-CS folks the perception seems to be more like English
| vs Russian.
|
| Even more complicated, researchers may be extremely committed to
| a specific library in a language and suspicious of languages that
| don't have their favorite library available.
|
| Any shift to new tooling for these highly-committed users will
| almost certainly require _large_ and obvious benefits to gain
| traction.
| travisgriggs wrote:
| So basically, the same thing that kept(keeps?) Visual Basic in
| use for so long.
|
| My son works in polysci analytics and I see the same thing you
| describe. A group will pick a tool and flog all problems with
| it. Change rarely occurs. He was in the Stata camp at one
| university, the TidyVerse at MIT.
|
| It's very weird for me, I develop and maintain a piece of
| software that that has 3 OSes, and 5 languages to wrestle with
| as well as multiple "tool" technologies like Ansible/MQTT, etc.
| so I'm very much in a polyglot-best-tool-for-the-job
| environment. Observationally from a casual POV, I see pros/cons
| both ways.
| psychometry wrote:
| Scientists like using R instead of because the language lets
| them get set up and coding quickly with RStudio. More
| importantly, the language, tooling, and ecosystem is very
| forgiving when it comes to code quality and style. There is
| good R code out there, but the R community generally lacks the
| wide acceptance of good coding practices you see with Python
| users: unit tests, sane dependency management, type hints,
| documentation, safe namespacing, etc.
|
| It's really saying something when scientists think writing
| Python code is a pain, because Python's a pretty forgiving
| language, too.
| dr_kiszonka wrote:
| Very interesting! I noticed a similar phenomenon in the GIS
| space. All of my colleagues with formal training in GIS use
| ArcGIS and its Python API, but those without such background
| gravitate towards FLOSS solutions.
|
| I am aware of only one case where a community migrated to other
| software. Many economists I know switched from Stata to R. Some
| of them later moved on to Python.
| encode wrote:
| Also see this comparison between Julia's BioSequences and Seq by
| Jakob Nissen and Ben Ward: https://biojulia.net/post/seq-lang/
| dunefox wrote:
| This shows imo that BioJulia is better, precisely because it
| validates data and is a broader programming language invented
| for science, not a DSL that optimises for speed over all else.
| Besides the new version of BuiJulia seems to perform even
| better than seq.
| dgb23 wrote:
| An interesting takeaway:
|
| > So it appears the primary reason BioJulia code is slower than
| Seq code in these three benchmarks is that BioSequences.jl is
| doing important work for you that Seq is not doing. As
| scientists, we hope you value tools that spend the time and
| effort to validate inputs given to it rather than fail
| silently.
|
| Reminds me of the myriads of Excel catastrophes.
| tdido wrote:
| See also:
|
| https://dl.acm.org/doi/pdf/10.1145/3360551
|
| https://www.nature.com/articles/s41587-021-00985-6 (paywalled)
| fwip wrote:
| It's an impressive project, but I'm not sure the niche is big
| enough. It's certainly come a long way since the last time I
| looked at it!
|
| My biggest concern is that Seq sucks users into a sort of local
| maximum. While piping syntax is nice, and the built-in routines
| are handy, it's a lot less flexible than a "mainstream"
| programming language, simply because of the smaller community and
| relative paucity of libraries. BioPython[1] has been around a
| long long time, and I think a lot of potential users of Seq would
| be better suited by using a regular bioinformatics library in the
| language they know best.
|
| e.g: The example of reading Fasta files in Seq:
| # iterate over everything for r in FASTA('genome.fa'):
| print r.name print r.seq
|
| versus BioPython: from Bio import SeqIO
| for r in SeqIO.parse("genome.fa", "fasta"):
| print(r.id) print(r.seq)
|
| It might be pretty useful as a teaching tool, but I'm skeptical
| of its long-term benefit to professionals. I'm not sure the
| ecosystem of Seq users will be large enough, y'know? Again, it's
| pretty impressive work, and it's come a long way. I wish the devs
| all the best. :)
|
| 1. https://biopython.org/
| kasperset wrote:
| I like this idea. However to me it is similar to using a la carte
| tools/programs along with bash script or DSL such as Nextflow.
| More often these stand-alone programs are already written in
| compiled languages. I am sure Seq will allow to build customized
| programs as compared to scripting or gluing programs.
| totalperspectiv wrote:
| It's odd that they didn't include Nim in the benchmarks in their
| paper: https://dl.acm.org/doi/pdf/10.1145/3360551
| jpxw wrote:
| I know nothing about Nim or genomics. Why is it odd that they
| didn't include Nim?
| pietroppeter wrote:
| Nim has had some success in genomics mainly thanks to the
| work of https://github.com/brentp
|
| Nim can be sold as a "A strongly-typed and statically-
| compiled high-performance Pythonic language" as Seq (although
| it is more than that and does not actually have as a goal to
| be Pythonic, see https://nim-lang.org/ or
| https://github.com/Araq/nimconf2021/blob/main/zennim.rst).
|
| Still, given the small size of Nim community and even smaller
| size of the genomics nim subcommunity, I would say it is not
| that odd that is not included in the benchmark. The existing
| nim genomics library might not even cover the functionalities
| required by the benchmark.
| lf-non wrote:
| Nim is not really 'pythonic'. It does have some superficial
| similarity with Python (being whitespace sensitive) but it
| begins to diverge pretty soon. This is not really a
| criticism of Nim. I quite like many of the choices in Nim.
|
| Seq claims that vast majority of python programs would work
| as is. I have not validated that claim, but Nim can
| absolutely not make that claim. Any python library would
| require substantial porting effort to be translated to nim.
| Zababa wrote:
| Calling Nim Python is like calling OCaml or Scala Python,
| it's not really true. The main reason people use Python is
| because it is Python, not because of an extractable list of
| things.
| bscphil wrote:
| > Seq is a Python-compatible language, and the vast majority of
| Python programs should work without any modifications
|
| > Seq is able to outperform Python code by up to 160x.
|
| So ... a reimplementation of Python that can outperform cpython
| by over 100 times? I know _literally nothing_ about this project,
| but I have to say that rings pretty false for me. Hell, even PyPy
| has trouble with many applications. (Plus they 're claiming to
| outperform "equivalent" C code by 2x.)
|
| Even if the performance claims _are_ overblown, it 's always nice
| to see new work on compiled languages with easy-to-read syntax.
| It's hard to beat Python for an education / prototyping language,
| so I will definitely be giving this a look.
| drocer88 wrote:
| Look at the link: https://github.com/seq-lang/seq It says 96%
| of the code is C++ in the "Languages" box on the right. C ( and
| C++ and Rust) outperforms Python in benchmarks and certain
| optimized C code can do 160x over very naive Python. So this is
| very possible, though the routines tested are probably cherry
| picked for bragging rights.
| hoseja wrote:
| Probably, it can outperform generic python _specifically for
| genomics payloads_ , versus python code/C code.
| snicker7 wrote:
| Most newer languages will give you multiple orders of magnitude
| better performance than python.
|
| Python's main advantage was that it was easier than some of its
| competitors (C++/Java). But that is no longer the case with
| modern languages (Nim/Crystal/Julia/JavaScript) being both
| faster and comparably as easy (or easier).
|
| It is now coasting off its momentum, mostly do to the vast
| amount of (usually poorly designed) open source libraries. That
| and Jupyter.
| amelius wrote:
| It's probably in the same sense that Numpy is much faster than
| doing matrix operations with pure Python arrays and Python for-
| loops.
| arc-in-space wrote:
| > We show that many important and widely-used NGS algorithms
| can be made up to 160x faster than their Python counterparts as
| well as 2x faster than the existing hand-optimized C++
| implementations
|
| It seems it's better to think of this particular claim as "we
| made a C++ algorithm that is 2x faster than the previous SotA
| C++ algorithm" (with the help of a heavily optimized DSL).
| aldanor wrote:
| I also know literally nothing about this particular project,
| but why not? If you support a small restricted subset of Python
| it's completely doable under _certain conditions_ for _specific
| types of programs_. E.g., Numba can easily outperform Python
| 100-1000x in numerical applications (done it myself multiple
| times), simply because it jit-compiles the code by first
| translating it to LLVM IR.
| mhenders wrote:
| (minor contributor) I've been following the project for a
| while and pleasantly surprised by the ability to manually
| convert Python programs to Seq without needing to make too
| many changes. Note, most of my experimentation has been with
| smallish programs I've written. I like that I can still think
| "Pythonically" and compose mostly correct Seq code using
| familiar idioms, e.g. list/set/dict comprehensions. The
| standard library is very readable and a source for "from
| import" type functionality. Some of the other features I've
| come to appreciate: pipeline operator |>, JIT compile or
| create an executable (seqc run, seqc build), match
| statements, and strong typing.
| bscphil wrote:
| > If you support a small restricted subset of Python
|
| That's why I quoted their claim that the "vast majority" of
| Python programs run _unmodified_. Even PyPy barely achieves
| that. To really get 100x performance over Python (and even
| supposedly beat C) with a compiler that works on most
| unmodified Python code would be an _extraordinary_
| achievement.
| dunefox wrote:
| That seems misrepresenting the original points: it can run
| the vast majority of python programs unmodified AND in some
| cases outperform Python - not at the same time.
| dekhn wrote:
| Typically, any high performance (low latency or high throughput)
| genomics/bioinformatics applicaiton is not going to be written in
| plain Python, except possibly for prototyping. Instead, nearly
| all codes today are written in C++ or Java, with some sort of
| command and control in Python or a DAG-based workflow scheduler.
|
| I don't expect the community will adopt other languages at a
| large scale. My hope, though, is that more of these algorithms
| move to real distributed processing systems like Spark, to take
| advantage of all the great ideas in systems like that. But
| genomics will continue to trail the leading edge by about 20
| years for the foreseeable future.
| east2west wrote:
| I recall that the group that created Spark had a bioinformatics
| project on Spark but I don't know what happened to it. All I
| could find now is a paper[1] hosted by databricks.
|
| [1]https://databricks.com/wp-
| content/uploads/2018/08/SSE15-40-D...
| adgjlsfhk1 wrote:
| IMO, spark isn't the way forward. The typical pattern with it
| is it lets you scale up to 100 cores really easily which is
| almost enough to compete with a good single threaded
| implementation in a fast language.
| dekhn wrote:
| 100 cores? I forgot how to count that low.
|
| The workflows I deal with generally involve moving hundreds
| of terabytes of storage into memory, processing it, and
| writing it out. Single machines (even beefy ones) tend to hit
| their limits (networking, max RAM, cache size, TLB, etc).
|
| Maybe there's another tool better than spark, i don't know,
| the important thing is that spark is the most ubiquitous.
| [deleted]
| fuzzythinker wrote:
| Used it for coding Coursera/Stepik's Bioinformatics course [1]
| when it was first announced 2 years ago.
|
| Not claiming it as any sort of reference, but you can see how it
| [2] may be used to solve some basic genome sequencing.
|
| [1] https://www.coursera.org/specializations/bioinformatics
|
| [2] https://github.com/fuzzthink/seq-genomics
| f6v wrote:
| > Think of Seq as a strongly-typed and statically-compiled
| Python: all the bells and whistles of Python, boosted with a
| strong type system, without any performance overhead.
|
| A pitch most people doing applied bioinformatics won't
| understand/appreciate.
| jack_riminton wrote:
| How do you pronounce Seq?
| adgjlsfhk1 wrote:
| Short for sequence.
| jack_riminton wrote:
| So is it pronounced sequence or like 'seek'?
| da39a3ee wrote:
| I have high confidence it's pronounced "seek".
| arshajii wrote:
| Hi everyone, I'm one of the developers on the Seq project -- I
| was delighted to see it posted here! We started this project with
| a focus on bioinformatics, but since then we've added a lot of
| language features/libraries that have closed the gap with Python
| by a decent margin, and Seq today can be useful in other areas or
| even for general Python programs (although there are still
| limitations of course). We're in the process of creating an
| extensible / plugin-able Python compiler based on Seq that allow
| for other domain-extensions. The upcoming release also has some
| neat features like OpenMP integration (e.g. "@par(num_threads=10)
| for i in range(N): ..." will run the loop with 10 threads). Happy
| to answer any questions!
| adgjlsfhk1 wrote:
| Have follow-up benchmarks vs BioJulia been done since 2019? If
| I remember correctly at the time, the result was that BioJulia
| was faster once you consider that it did validation.
| arshajii wrote:
| We haven't done too many comparisons with BioJulia since that
| paper, although we did address the (valid) issues they raised
| such as data validation (i.e. Seq now validates input data by
| default, but this can be optionally disabled). We did compare
| against them in our last paper in a sequence alignment
| benchmark: https://www.nature.com/articles/s41587-021-00985-6
| (check the supplement).
| [deleted]
| winter_squirrel wrote:
| This looks cool, I also love how easy the setup was considering
| lots of niche languages I try sometimes seem to have arcane setup
| steps and dependencies
___________________________________________________________________
(page generated 2021-09-15 23:01 UTC)