hngopher.com

       [HN Gopher] Seq - A programming language for computational genom...
       ___________________________________________________________________
        
       Seq - A programming language for computational genomics and
       bioinformatics
        
       Author : tdido
       Score  : 103 points
       Date   : 2021-09-15 09:54 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Bostonian wrote:
       | The code examples look like Python 2 rather than Python 3. Print
       | does have not parentheses. Why was this decision made?
        
         | haihaibye wrote:
         | They support both print syntaxes, and will deprecate Python 2
         | style soon.
         | 
         | https://github.com/seq-lang/seq/issues/223
        
         | [deleted]
        
       | car wrote:
       | Looks great, will definitely give this a try since it does
       | sequence manipulations that I otherwise have to write myself.
       | 
       | Will this be available via conda? And how would seq integreate
       | with Snakemake, since that is also based on Python?
        
         | tdido wrote:
         | Seems like there's a conda package in the works:
         | https://github.com/bioconda/bioconda-recipes/pull/29660
        
         | dunefox wrote:
         | BioJulia might be the better choice since it validates data and
         | seq doesn't: https://biojulia.net/post/seq-lang/
        
       | chmaynard wrote:
       | I'm wondering if Seq can also serve as a general-purpose
       | replacement for Python whenever a fast executable is needed.
        
         | arshajii wrote:
         | (I'm one of the developers on Seq.) We've actually been working
         | mostly on closing the gap with Python for the last year or so.
         | Seq can be useful for plain Python programs as well -- I give a
         | bit more context in my comment above.
        
         | dunefox wrote:
         | It's a domain specific language for bioinformatics. So, most
         | likely not.
        
       | gandalfgeek wrote:
       | Quick explainer video: https://youtu.be/5bk4Wc5Op2M
        
       | clusterhacks wrote:
       | I am a CS person who works with bioinformaticians every day as
       | part of my job.
       | 
       | I really like that Seq seems to have built-in some
       | parallelization ability. I spend no small amount of time in my
       | day job doing that manually in R with RcppParallel for loops that
       | are totally independent across each iteration.
       | 
       | Bioinformaticians are often educated to use a specific
       | programming language and environment. They aren't usually looking
       | to try other languages. For example, I support our bioinformatics
       | group and they are basically 100% R and RStudio users. We have a
       | single user of Python and that user is doing "typical" tensorflow
       | stuff with images.
       | 
       | I've noticed this same bias towards a single language for some
       | other academic niches. Like SAS or Stata camps in public health
       | or psychology - I think of these languages as basically the same,
       | but for non-CS folks the perception seems to be more like English
       | vs Russian.
       | 
       | Even more complicated, researchers may be extremely committed to
       | a specific library in a language and suspicious of languages that
       | don't have their favorite library available.
       | 
       | Any shift to new tooling for these highly-committed users will
       | almost certainly require _large_ and obvious benefits to gain
       | traction.
        
         | travisgriggs wrote:
         | So basically, the same thing that kept(keeps?) Visual Basic in
         | use for so long.
         | 
         | My son works in polysci analytics and I see the same thing you
         | describe. A group will pick a tool and flog all problems with
         | it. Change rarely occurs. He was in the Stata camp at one
         | university, the TidyVerse at MIT.
         | 
         | It's very weird for me, I develop and maintain a piece of
         | software that that has 3 OSes, and 5 languages to wrestle with
         | as well as multiple "tool" technologies like Ansible/MQTT, etc.
         | so I'm very much in a polyglot-best-tool-for-the-job
         | environment. Observationally from a casual POV, I see pros/cons
         | both ways.
        
         | psychometry wrote:
         | Scientists like using R instead of because the language lets
         | them get set up and coding quickly with RStudio. More
         | importantly, the language, tooling, and ecosystem is very
         | forgiving when it comes to code quality and style. There is
         | good R code out there, but the R community generally lacks the
         | wide acceptance of good coding practices you see with Python
         | users: unit tests, sane dependency management, type hints,
         | documentation, safe namespacing, etc.
         | 
         | It's really saying something when scientists think writing
         | Python code is a pain, because Python's a pretty forgiving
         | language, too.
        
         | dr_kiszonka wrote:
         | Very interesting! I noticed a similar phenomenon in the GIS
         | space. All of my colleagues with formal training in GIS use
         | ArcGIS and its Python API, but those without such background
         | gravitate towards FLOSS solutions.
         | 
         | I am aware of only one case where a community migrated to other
         | software. Many economists I know switched from Stata to R. Some
         | of them later moved on to Python.
        
       | encode wrote:
       | Also see this comparison between Julia's BioSequences and Seq by
       | Jakob Nissen and Ben Ward: https://biojulia.net/post/seq-lang/
        
         | dunefox wrote:
         | This shows imo that BioJulia is better, precisely because it
         | validates data and is a broader programming language invented
         | for science, not a DSL that optimises for speed over all else.
         | Besides the new version of BuiJulia seems to perform even
         | better than seq.
        
         | dgb23 wrote:
         | An interesting takeaway:
         | 
         | > So it appears the primary reason BioJulia code is slower than
         | Seq code in these three benchmarks is that BioSequences.jl is
         | doing important work for you that Seq is not doing. As
         | scientists, we hope you value tools that spend the time and
         | effort to validate inputs given to it rather than fail
         | silently.
         | 
         | Reminds me of the myriads of Excel catastrophes.
        
       | tdido wrote:
       | See also:
       | 
       | https://dl.acm.org/doi/pdf/10.1145/3360551
       | 
       | https://www.nature.com/articles/s41587-021-00985-6 (paywalled)
        
       | fwip wrote:
       | It's an impressive project, but I'm not sure the niche is big
       | enough. It's certainly come a long way since the last time I
       | looked at it!
       | 
       | My biggest concern is that Seq sucks users into a sort of local
       | maximum. While piping syntax is nice, and the built-in routines
       | are handy, it's a lot less flexible than a "mainstream"
       | programming language, simply because of the smaller community and
       | relative paucity of libraries. BioPython[1] has been around a
       | long long time, and I think a lot of potential users of Seq would
       | be better suited by using a regular bioinformatics library in the
       | language they know best.
       | 
       | e.g: The example of reading Fasta files in Seq:
       | # iterate over everything         for r in FASTA('genome.fa'):
       | print r.name             print r.seq
       | 
       | versus BioPython:                   from Bio import SeqIO
       | for r in SeqIO.parse("genome.fa", "fasta"):
       | print(r.id)             print(r.seq)
       | 
       | It might be pretty useful as a teaching tool, but I'm skeptical
       | of its long-term benefit to professionals. I'm not sure the
       | ecosystem of Seq users will be large enough, y'know? Again, it's
       | pretty impressive work, and it's come a long way. I wish the devs
       | all the best. :)
       | 
       | 1. https://biopython.org/
        
       | kasperset wrote:
       | I like this idea. However to me it is similar to using a la carte
       | tools/programs along with bash script or DSL such as Nextflow.
       | More often these stand-alone programs are already written in
       | compiled languages. I am sure Seq will allow to build customized
       | programs as compared to scripting or gluing programs.
        
       | totalperspectiv wrote:
       | It's odd that they didn't include Nim in the benchmarks in their
       | paper: https://dl.acm.org/doi/pdf/10.1145/3360551
        
         | jpxw wrote:
         | I know nothing about Nim or genomics. Why is it odd that they
         | didn't include Nim?
        
           | pietroppeter wrote:
           | Nim has had some success in genomics mainly thanks to the
           | work of https://github.com/brentp
           | 
           | Nim can be sold as a "A strongly-typed and statically-
           | compiled high-performance Pythonic language" as Seq (although
           | it is more than that and does not actually have as a goal to
           | be Pythonic, see https://nim-lang.org/ or
           | https://github.com/Araq/nimconf2021/blob/main/zennim.rst).
           | 
           | Still, given the small size of Nim community and even smaller
           | size of the genomics nim subcommunity, I would say it is not
           | that odd that is not included in the benchmark. The existing
           | nim genomics library might not even cover the functionalities
           | required by the benchmark.
        
             | lf-non wrote:
             | Nim is not really 'pythonic'. It does have some superficial
             | similarity with Python (being whitespace sensitive) but it
             | begins to diverge pretty soon. This is not really a
             | criticism of Nim. I quite like many of the choices in Nim.
             | 
             | Seq claims that vast majority of python programs would work
             | as is. I have not validated that claim, but Nim can
             | absolutely not make that claim. Any python library would
             | require substantial porting effort to be translated to nim.
        
             | Zababa wrote:
             | Calling Nim Python is like calling OCaml or Scala Python,
             | it's not really true. The main reason people use Python is
             | because it is Python, not because of an extractable list of
             | things.
        
       | bscphil wrote:
       | > Seq is a Python-compatible language, and the vast majority of
       | Python programs should work without any modifications
       | 
       | > Seq is able to outperform Python code by up to 160x.
       | 
       | So ... a reimplementation of Python that can outperform cpython
       | by over 100 times? I know _literally nothing_ about this project,
       | but I have to say that rings pretty false for me. Hell, even PyPy
       | has trouble with many applications. (Plus they 're claiming to
       | outperform "equivalent" C code by 2x.)
       | 
       | Even if the performance claims _are_ overblown, it 's always nice
       | to see new work on compiled languages with easy-to-read syntax.
       | It's hard to beat Python for an education / prototyping language,
       | so I will definitely be giving this a look.
        
         | drocer88 wrote:
         | Look at the link: https://github.com/seq-lang/seq It says 96%
         | of the code is C++ in the "Languages" box on the right. C ( and
         | C++ and Rust) outperforms Python in benchmarks and certain
         | optimized C code can do 160x over very naive Python. So this is
         | very possible, though the routines tested are probably cherry
         | picked for bragging rights.
        
         | hoseja wrote:
         | Probably, it can outperform generic python _specifically for
         | genomics payloads_ , versus python code/C code.
        
         | snicker7 wrote:
         | Most newer languages will give you multiple orders of magnitude
         | better performance than python.
         | 
         | Python's main advantage was that it was easier than some of its
         | competitors (C++/Java). But that is no longer the case with
         | modern languages (Nim/Crystal/Julia/JavaScript) being both
         | faster and comparably as easy (or easier).
         | 
         | It is now coasting off its momentum, mostly do to the vast
         | amount of (usually poorly designed) open source libraries. That
         | and Jupyter.
        
         | amelius wrote:
         | It's probably in the same sense that Numpy is much faster than
         | doing matrix operations with pure Python arrays and Python for-
         | loops.
        
         | arc-in-space wrote:
         | > We show that many important and widely-used NGS algorithms
         | can be made up to 160x faster than their Python counterparts as
         | well as 2x faster than the existing hand-optimized C++
         | implementations
         | 
         | It seems it's better to think of this particular claim as "we
         | made a C++ algorithm that is 2x faster than the previous SotA
         | C++ algorithm" (with the help of a heavily optimized DSL).
        
         | aldanor wrote:
         | I also know literally nothing about this particular project,
         | but why not? If you support a small restricted subset of Python
         | it's completely doable under _certain conditions_ for _specific
         | types of programs_. E.g., Numba can easily outperform Python
         | 100-1000x in numerical applications (done it myself multiple
         | times), simply because it jit-compiles the code by first
         | translating it to LLVM IR.
        
           | mhenders wrote:
           | (minor contributor) I've been following the project for a
           | while and pleasantly surprised by the ability to manually
           | convert Python programs to Seq without needing to make too
           | many changes. Note, most of my experimentation has been with
           | smallish programs I've written. I like that I can still think
           | "Pythonically" and compose mostly correct Seq code using
           | familiar idioms, e.g. list/set/dict comprehensions. The
           | standard library is very readable and a source for "from
           | import" type functionality. Some of the other features I've
           | come to appreciate: pipeline operator |>, JIT compile or
           | create an executable (seqc run, seqc build), match
           | statements, and strong typing.
        
           | bscphil wrote:
           | > If you support a small restricted subset of Python
           | 
           | That's why I quoted their claim that the "vast majority" of
           | Python programs run _unmodified_. Even PyPy barely achieves
           | that. To really get 100x performance over Python (and even
           | supposedly beat C) with a compiler that works on most
           | unmodified Python code would be an _extraordinary_
           | achievement.
        
             | dunefox wrote:
             | That seems misrepresenting the original points: it can run
             | the vast majority of python programs unmodified AND in some
             | cases outperform Python - not at the same time.
        
       | dekhn wrote:
       | Typically, any high performance (low latency or high throughput)
       | genomics/bioinformatics applicaiton is not going to be written in
       | plain Python, except possibly for prototyping. Instead, nearly
       | all codes today are written in C++ or Java, with some sort of
       | command and control in Python or a DAG-based workflow scheduler.
       | 
       | I don't expect the community will adopt other languages at a
       | large scale. My hope, though, is that more of these algorithms
       | move to real distributed processing systems like Spark, to take
       | advantage of all the great ideas in systems like that. But
       | genomics will continue to trail the leading edge by about 20
       | years for the foreseeable future.
        
         | east2west wrote:
         | I recall that the group that created Spark had a bioinformatics
         | project on Spark but I don't know what happened to it. All I
         | could find now is a paper[1] hosted by databricks.
         | 
         | [1]https://databricks.com/wp-
         | content/uploads/2018/08/SSE15-40-D...
        
         | adgjlsfhk1 wrote:
         | IMO, spark isn't the way forward. The typical pattern with it
         | is it lets you scale up to 100 cores really easily which is
         | almost enough to compete with a good single threaded
         | implementation in a fast language.
        
           | dekhn wrote:
           | 100 cores? I forgot how to count that low.
           | 
           | The workflows I deal with generally involve moving hundreds
           | of terabytes of storage into memory, processing it, and
           | writing it out. Single machines (even beefy ones) tend to hit
           | their limits (networking, max RAM, cache size, TLB, etc).
           | 
           | Maybe there's another tool better than spark, i don't know,
           | the important thing is that spark is the most ubiquitous.
        
       | [deleted]
        
       | fuzzythinker wrote:
       | Used it for coding Coursera/Stepik's Bioinformatics course [1]
       | when it was first announced 2 years ago.
       | 
       | Not claiming it as any sort of reference, but you can see how it
       | [2] may be used to solve some basic genome sequencing.
       | 
       | [1] https://www.coursera.org/specializations/bioinformatics
       | 
       | [2] https://github.com/fuzzthink/seq-genomics
        
       | f6v wrote:
       | > Think of Seq as a strongly-typed and statically-compiled
       | Python: all the bells and whistles of Python, boosted with a
       | strong type system, without any performance overhead.
       | 
       | A pitch most people doing applied bioinformatics won't
       | understand/appreciate.
        
       | jack_riminton wrote:
       | How do you pronounce Seq?
        
         | adgjlsfhk1 wrote:
         | Short for sequence.
        
           | jack_riminton wrote:
           | So is it pronounced sequence or like 'seek'?
        
             | da39a3ee wrote:
             | I have high confidence it's pronounced "seek".
        
       | arshajii wrote:
       | Hi everyone, I'm one of the developers on the Seq project -- I
       | was delighted to see it posted here! We started this project with
       | a focus on bioinformatics, but since then we've added a lot of
       | language features/libraries that have closed the gap with Python
       | by a decent margin, and Seq today can be useful in other areas or
       | even for general Python programs (although there are still
       | limitations of course). We're in the process of creating an
       | extensible / plugin-able Python compiler based on Seq that allow
       | for other domain-extensions. The upcoming release also has some
       | neat features like OpenMP integration (e.g. "@par(num_threads=10)
       | for i in range(N): ..." will run the loop with 10 threads). Happy
       | to answer any questions!
        
         | adgjlsfhk1 wrote:
         | Have follow-up benchmarks vs BioJulia been done since 2019? If
         | I remember correctly at the time, the result was that BioJulia
         | was faster once you consider that it did validation.
        
           | arshajii wrote:
           | We haven't done too many comparisons with BioJulia since that
           | paper, although we did address the (valid) issues they raised
           | such as data validation (i.e. Seq now validates input data by
           | default, but this can be optionally disabled). We did compare
           | against them in our last paper in a sequence alignment
           | benchmark: https://www.nature.com/articles/s41587-021-00985-6
           | (check the supplement).
        
             | [deleted]
        
       | winter_squirrel wrote:
       | This looks cool, I also love how easy the setup was considering
       | lots of niche languages I try sometimes seem to have arcane setup
       | steps and dependencies
        
       ___________________________________________________________________
       (page generated 2021-09-15 23:01 UTC)