[HN Gopher] Show HN: Excel to Python Compiler
       ___________________________________________________________________
        
       Show HN: Excel to Python Compiler
        
       We (me and @aarondia) built a tool to help you turn psuedo-software
       Excel files into real-software Python. Ideally, Pyoneer helps you
       automate your manual Excel processes. You can try it today here:
       https://pyoneer.ai.   _How it works:_  1. You upload an Excel file
       2. We statically parse the Excel file and build a dependency graph
       of all the cells, tables, formulas, and pivots.  3. We do a graph
       traversal, and translate nodes as we hit them. We use OpenAI APIs
       to translate formulas. There's a bunch of extra work here --
       because even with the best prompt engineering a fella like me can
       do, OpenAI sucks at translating formulas (primarily because it
       doesn't know what datatypes its dealing with). We augment this
       translation with a mapping from ranges to variable names and types,
       which in our experience can improve the percentage of correctly
       translatable formulas by about 5x.  4. We generate test cases for
       our translations as well, to make sure the Python process matches
       your Excel process.  5. We give you back a Jupyter notebook that
       contains the code we generated.  If there are pieces of the Excel
       we can't translate successfully (complex formulas, or pivot tables
       currently), then we leave them as a TODO in the code. This makes it
       easy for you to hop in and continue finishing the script.   _Who is
       this for:_  Developers who know Python, primarily! Pyoneer might be
       useful if:  1. You've got an Excel file you're looking to move to
       Python (usually for speed, size, or maintenance reasons).  2.
       There's enough logic contained in the notebook that it's going to
       be a hassle for you to just rewrite it from scratch.  3. Or you
       don't know the logic that is in the Excel workbook well since you
       didn't write it in the first place :)  Post translation, even if
       Pyoneer doesn't nail it perfectly or translate all the formulas,
       you'll be able to pop into the notebook and continue cleaning up
       the TODOs / finish writing the formulas.   _What the Alpha launch
       supports:_  Launched early! Currently we're focused on supporting:
       1. Any number of sheets, with any reference structure between them.
       2. Cells that translate as variables directly. We'll translate the
       formulas to Python code that has the same result, or else we'll
       generate a TODO letting you know we failed translating this cell.
       3. Tables that translate as Pandas dataframes. We support at most
       one table per sheet, at the tables must be contigious. If the
       formulas in a column are consistent, then we will try and translate
       this as a single pandas statement.  We do not support: pivot tables
       or complex formulas. When we fail to translate these, we generate
       TODO statements. We also don't support graphs or macros - and you
       won't see these reflected in the output at all currently.   _Why we
       built this:_  We did YCS20 and built an open source tool called
       Mito(https://trymito.io). It's been a good journey since then -
       we've scaled revenue and to over 2k Github stars
       (https://github.com/mito-ds/mito). But fundamentally, Mito is a
       tool that's useful for Excel users who wanted to start writing
       Python code more effectively.  We wanted to take another stab at
       the Excel -> Python pain point that was more developer focused -
       that helped developers that have to translate Excel files into
       Python do this much more quickly. Hence, Pyoneer!  I'll be in the
       comments today if you've got feedback, criticism, questions, or
       comments.
        
       Author : narush
       Score  : 51 points
       Date   : 2024-05-23 17:40 UTC (5 hours ago)
        
 (HTM) web link (pyoneer.ai)
 (TXT) w3m dump (pyoneer.ai)
        
       | trevzercap wrote:
       | Very cool, what are your plans for working with external data
       | sources via power query?
        
         | narush wrote:
         | Totally on the roadmap, but not sure when yet!
         | 
         | The problem is data gets into these mega-excels through all
         | sorts of funky routes... and I really do mean funky :)
         | 
         | 1. PowerQuery: this is defined statically in the notebook so is
         | detectable by Pyoneer. But I don't know a ton about the
         | integration in Python here. I imagine this is doable.
         | 
         | 2. Manual data entry: Pyoneer can't detect this from a static
         | Excel file, really - what's the difference between the static
         | Excel sheet and data updates ever time? Oftentimes, users with
         | a lot of manual data entry to "automate this in Python" by
         | turning an Excel file into like a form. Generating a proper web
         | app out of the Excel file would be pretty sweet!
         | 
         | 3. Database output copied - aka, copy in a table. This one is
         | sometimes pretty crazy - I've seen Excel workbooks that have
         | SQL queries just copied and pasted into a random cell in the
         | notebook, so you can copy that and run it on some archaic SQL
         | server. And then copy the output back in...
         | 
         | 4. Macros: runs an API call, or an SQL query, or pulls (and
         | then formats) data from another Excel sheet. Then put it in the
         | right place. This then requires translating Macros - which are
         | a whole programming language of their own. This is actually
         | pretty high-priority for us right now, based on early feedback
         | from developers who are in the thick of it with big Excel
         | files.
         | 
         | 6. Custom plugins. Big finance shops build/buy plugins that
         | pull in data all the time! We haven't really started
         | investigating how to handle these.
         | 
         | 5. Other workbooks: at large banks, there's an additional
         | dependency graph of workbooks that rely on eachother across the
         | org. It's epic. There's a single workbook that defines all
         | market holidays, that's used for all excel files that do
         | performance reporting. And then these performance reports feed
         | into other Excel's (by way of direct references, but also by
         | way of copy and pasting, but also by way of
         | uploading/downloading through a database). Support multiple
         | Excel files at once is something we'll have to tackle
         | eventually!
         | 
         | So... there's a lot to do here. We're really early - so we're
         | focused on two primary things right now:
         | 
         | 1. Solving the most pressing pain points first. Hence the early
         | launch so we can talk to more folks and prioritize better. I've
         | got a reasonable idea since I've done so much of this work
         | myself, but every finance shop does things different...
         | 
         | 2. Leaving good TODOs when we can't translate something.
         | Currently, we can't translate pivot tables or complex formulas
         | -- but we generate TODOs for these so you can go back and fill
         | them in with the Python skills you have (and maybe the help of
         | ChatGPT).
         | 
         | We're aiming to just give you a Python script. So if we don't
         | translate the data pull how you want... you can just edit the
         | notebook :)
        
       | kingkongjaffa wrote:
       | What customer discovery have you done so far?
       | 
       | > turn psuedo-software Excel files into real-software Python
       | 
       | I'm curious how many people actually have this problem. It
       | strikes me that the corporate environments that are building big
       | hairy excel files probably have locked down IT where running
       | python is a difficult thing to do.
       | 
       | In the environments where you can get some kind of python
       | running, it's probably likely that developers are looped in as
       | part of an efficiency project, and doing some kind of voice of
       | the customer / requirements capture and starting to build from
       | scratch more or less.
       | 
       | There's a vanishing window for stuff like this, if you're a
       | Microsoft shop like 99% of the corporate world I think you are
       | turning those excel files into power apps and powerBI dashboards,
       | before you are hiring python devs.
        
         | kingkongjaffa wrote:
         | Additional thoughts:
         | 
         | How are you modeling the input data? Are you using dataclasses
         | and type annotations? something like
         | https://docs.pydantic.dev/latest/ ?
         | 
         | How/Where is the data stored? Not inside the notebook, right?
        
           | narush wrote:
           | Currently:
           | 
           | 1. Data remains stored in the excel file. The generated
           | script pulls the raw data directly from the notebook - but
           | it's a single read_xlsx call. So if you want to switch it out
           | for an API call, db read, whatever - it's easy to do so.
           | 
           | 2. We model data as primitive Python data types, or, if it's
           | a table, as a pandas dataframe.
           | 
           | Currently, we detect at most one table per sheet, and it's
           | gotta be contiguous. These are pretty huge limitations we'll
           | be relaxing soon -- but we wanted to get something out as
           | soon as it would have been useful to one person -- and in
           | it's current state, this would have helped me with some of my
           | larger Excel automation projects :)
        
         | Nicholas_C wrote:
         | From their website it sounds like they know there is a market
         | for this because they've been doing it manually for a while:
         | 
         | "Our founding team spent the past decade transitioning Excel
         | files to Python - from startups to insurance companies to bulge
         | bracket banks."
        
           | narush wrote:
           | Yeah, we built this because we wished we had it! I've spent
           | literally thousands of hours reimplementing Excel workbooks
           | in Python as support for our previous shot at this problem -
           | which was a spreadsheet that generates Python code as you
           | edit it.
        
         | robertlagrant wrote:
         | > if you're a Microsoft shop like 99% of the corporate world I
         | think you are turning those excel files into power apps and
         | powerBI dashboards, before you are hiring python devs.
         | 
         | This is a good incremental decision, but once everyone's done
         | it, perhaps an edge can be found by doing it properly. As much
         | as IT folk love administrating Microsoft products, the products
         | are terrible for the users.
        
         | narush wrote:
         | > What customer discovery have you done so far?
         | 
         | Me and my two cofounders spend the past 4 years working on Mito
         | (https://trymito.io) -- where our customers are primarily large
         | finance shops (including some bulge bracket banks you've heard
         | of) that has a really concrete goal of getting users out of
         | Excel and into Python. It's not every finance shop, but a quite
         | a few are trying to make this transition. This usually means:
         | multi-day Python trainings, a Python support team, a few
         | developers who semi-full-time job is helping transition
         | existing Excel processes to Python.
         | 
         | We built Mito to be a tool for the Excel-first users - we tried
         | to make it easier for them to use their existing spreadsheet
         | skills to write Python. But in working with the developers that
         | support these new Python users, it became clear to us that
         | there's a big pain point around:
         | 
         | 1. I'm a dev who was given a big, old Excel file
         | 
         | 2. It has a lot of business logic in it, understood by the
         | person who made it, but not by me - who is tasked with turning
         | it into real software
         | 
         | 3. I have to spend 100s of hours: trying to understand the
         | file, faithfully replicating the logic, and testing for
         | consistency - to convert this to an Excel process.
         | 
         | I personally have been this developer in quite a few cases -
         | just as support for Mito and helping these Excel users trying
         | to transition to Python. Some Excel files literally take 300+
         | hours to "rebuild from scratch" in Python. It's often very
         | engaging work, but brutally slow - so we're trying to automate
         | as much as we can with Pyoneer.
         | 
         | > There's a vanishing window for stuff like this, if you're a
         | Microsoft shop like 99% of the corporate world I think you are
         | turning those excel files into power apps and powerBI
         | dashboards, before you are hiring python devs.
         | 
         | I think this is a really fair point! We're not sure exactly
         | what a reasonable business model really looks like, long-term.
         | Right now, we're really focused on finding the developers for-
         | which this is a big pain point, and seeing what we need to
         | prioritize to make their lives better. I'm one of those
         | developers...
        
           | kingkongjaffa wrote:
           | Hey, thanks for the reply!
           | 
           | My next question would be how did Mito go?
           | 
           | What MMR did you get to?
           | 
           | How strong was the PMF?, 4 years seems like a long time to
           | test this product, and I'm not sure if the overall market for
           | this is too small. Is it the trap of dogfooding (building the
           | thing you wish you had) without sizing the market?
           | 
           | Mito site says: Trusted by dozens of fortune 500 companies,
           | how penetrated is that really? Is it one dev in each company
           | on the free tier or entire departments/teams using this on
           | the $150/month/user plan?
           | 
           | I built something similar (excel / python space ) but it was
           | really just one feature as part of a larger platform, not
           | something I would build a company around.
        
             | narush wrote:
             | Sure thing - thanks for the good thoughts!
             | 
             | We're still working on Mito - it's not a retired product by
             | any means. Pyoneer is just another stab at the same problem
             | for a different user group.
             | 
             | MMR-wise, we scaled to profitability. PMF-wise, we have not
             | reached this. We have large customers who make up the bulk
             | of our revenue who love the product, and use it quite
             | effectively as the basis for their entire Python program,
             | but, transparently, scaling is hard!
             | 
             | > Is it the trap of dogfooding (building the thing you wish
             | you had) without sizing the market?
             | 
             | Very possibly. But I think this is a much bigger pain point
             | at large orgs with legacy processes than you realize,
             | though. Every large bank has an entire development teams
             | that are tasked with transitioning legacy processes out of
             | spreadsheets. We're aiming to improve the efficiency of
             | these developers dramatically.
             | 
             | For some of the spreadsheets I've personally automated, I
             | think this would take 300 hours of work and make it like
             | 5...
        
         | Bostonian wrote:
         | 'I'm curious how many people actually have this problem. It
         | strikes me that the corporate environments that are building
         | big hairy excel files probably have locked down IT where
         | running python is a difficult thing to do.'
         | 
         | FWIW I have worked at 3 financial companies (including my
         | current job) where Excel is used heavily and where "desk
         | quants" are allowed to use Python.
        
       | hk1337 wrote:
       | This is really awesome.
       | 
       | > In addition, Content found on or through this Service are the
       | property of Saga Inc. or used with permission. You may not
       | distribute, modify, transmit, reuse, download, repost, copy, or
       | use said Content, whether in whole or in part, for commercial
       | purposes or for personal gain, without express advance written
       | permission from us.
       | 
       | Is the "Content" the results of what is uploaded?
        
         | narush wrote:
         | You 100% own all Python code that you download from Pyoneer.
         | Sorry for the lack of clarity here!
         | 
         | You're welcome to: 1. Edit it to fix up places where it can't
         | translate fully. 2. Send it to your colleagues and tell them
         | you wrote it (although... for your own personal morals... maybe
         | don't :) ) 3. Upload it to your companies Github 4. Whatever
         | the hell else you want
         | 
         | To be very clear: this is your code!
         | 
         | The code we generate currently has no dependencies on anything
         | other than pandas, numpy, the Python core library, and the
         | Excel file you uploaded in the first place. This might change
         | as we try and support more Excel features (and so need to do
         | some pivot table mocking), but we're not aiming to lock folks
         | in!
         | 
         | P.S. If you do anything particularly interesting with the code
         | you generate... tell me about it. nate @ sagacollab.com. I'd
         | love to hear P.P.S. I'll update the language on this ASAP.
        
       | rfergie wrote:
       | https://pypi.org/project/formulas/ might help with this without
       | needing the OpenAI part
        
         | narush wrote:
         | Thanks for the link! We looked into existing libraries for
         | Excel formula execution - this library as an awesome example!
         | 
         | We considered using this, but to copy from our MVP spec:
         | 
         | The easiest thing to do is to replicate Excel's execution
         | engine -- you can see someone who has done this
         | [here](https://pypi.org/project/xlcalculator/). But this is
         | _just evaluating Excel formulas is not what users want when
         | transitioning a process to Excel - they want to be able to
         | ditch Excel (mostly).
         | 
         | The next easiest thing would be to transpile Excel formulas to
         | the following format:                   # This is not a useful
         | or usable artifact; you're still trapped in Excel         # but
         | things are just worse now, since it's not on a 2d grid
         | A1 = "Prices"         A2 = 1         A3 = 2         B1 = "With
         | Tax"         B2 = SUM(A1, 10) * 1.3         B3 = SUM(A2, 10) *
         | 1.3
         | 
         | But this is ultimately is an awful solution for the user:
         | 
         | 1. It's 100% impossible to read. A large Excel file often has
         | 100k+ formulas (many of them with shared structures). This is
         | 100k+ lines of code...
         | 
         | 2. It's impossible to maintain. Yeah, since it lacks all
         | semantic structure, there's no f**_ way you're going to test it
         | or modify it.
         | 
         | In general: *we're trying to generate a Python script that
         | appears to be written by an expert developer. To do this, you
         | have to be willing to ditch the Excel formulas / execution
         | engine.
        
       | bko wrote:
       | Why would you get an LLM to transpile the formula for you? There
       | are already libraries that attempt to do this. Wouldn't your time
       | be better served in expanding these libraries to overcome
       | deficiencies? I don't get why you would throw away all the work
       | done in this space and just offload it to an LLM. It's a
       | technical problem. Imagine if ditched all transpilers and gave it
       | to an LLM and said "good luck!"
       | 
       | https://formulas.readthedocs.io/en/stable/
        
         | flashgordon wrote:
         | This jumped out for me too. Here seems like a huge wasted
         | opportunity.
        
           | narush wrote:
           | Check my comment to OP on this thread - hopefully this is
           | answers why we don't do this!
           | 
           | I will say: we are considering some more approaches from
           | traditional compilation / transpilation. I think these are
           | very compatible!
        
         | narush wrote:
         | Copying and pasting from a comment below -- we considered this
         | heavily during our MVP, but in practice "moving an Excel
         | process to Python" does not just mean executing an Excel file
         | through Python. This pretty much doesn't replace your Excel
         | dependence at all.
         | 
         | Consider the following (pretty easy) translation of a simple
         | table:                   # This is not a useful or usable
         | artifact; you're still trapped in Excel         # but things
         | are just worse now, since it's not on a 2d grid         A1 =
         | "Prices"         A2 = 1         A3 = 2         B1 = "With Tax"
         | B2 = SUM(A1, 10) * 1.3         B3 = SUM(A2, 10) * 1.3
         | 
         | But this is ultimately is an awful solution for the user: 1.
         | It's 100% impossible to read. A large Excel file often has
         | 100k+ formulas (many of them with shared structures). This is
         | 100k+ lines of code... 2. It's impossible to maintain. Yeah,
         | since it lacks all semantic structure, there's no f* way you're
         | going to test it or modify it.
         | 
         | To make it really concrete: you can't just transpile a SUMIF or
         | VLOOKUP to a Python implementation of SUMIF or VLOOKUP, and you
         | absolutely can't do this on a cell-by-cell basis.
         | 
         | Rather: we're trying to generate a Python script that appears
         | to be written by an expert developer. To do this, you have to
         | be willing to ditch the Excel formulas / execution engine, do
         | more abstract reasoning over the file (like identifying tables
         | / consistent formulas in columns and translating them as pandas
         | dataframes), and translate them without just relying on
         | matching the Excel exactly.
         | 
         | You want something closer to this:                   df =
         | pd.DataFrame({'Prices': [1, 2]})         df['With Tax'] =
         | (df['Prices'] + 10) * 1.3
         | 
         | We want parity of outputs, not parity in how we get there!
        
           | jkaptur wrote:
           | > We want parity of outputs, not parity in how we get there!
           | 
           | To be clear, though, you don't necessarily have parity of
           | outputs.
           | 
           | It seems strange to me to sacrifice correctness for readable
           | output. I would prefer a deterministic strategy that is
           | always correct and sometimes readable. You could do that by
           | generating an intermediate structure A1=... A2=..., then
           | applying heuristics to say "hey, this enormous column of
           | VLOOKUPs is actually a join", and so on. Maybe LLMs could
           | advise on that, but I'm not sure how you'd check their
           | work...
           | 
           | ... Anyway you're the person "in the arena", having actually
           | created something, so well done!
        
             | narush wrote:
             | > To be clear, though, you don't necessarily have parity of
             | outputs.
             | 
             | The cool thing is that the Excel file is both the
             | programatic specification of the process as well as the
             | actual output data you want as well. We can check parity of
             | outputs by comparing the data we create with Python to the
             | data in Excel - in practice, Pyoneer generates test cases
             | for tables that do exactly this, even when we can't
             | translate every formula correctly!
             | 
             | > applying heuristics to say "hey, this enormous column of
             | VLOOKUPs is actually a join", and so on.
             | 
             | We do this deterministically currently. The only non-
             | deterministic aspect is formula translation - where we
             | defer to some LLM. Structurally, everything is
             | deterministic though - and here we really do aim for
             | readability (there's a lot more to do here though).
        
       | flashgordon wrote:
       | So is this a one-way only process (I'd assume it is)? This
       | assumes that the once the python engineers who are "tasked" to
       | productionize it, no more changes are allowed to the sheets or do
       | you allow incremental updates?
        
         | narush wrote:
         | Yep - this is a one-way process! You can think of it like an
         | eject from Excel, in the best case.
         | 
         | The devs we've worked with so far have the goal of replacing
         | the Excel process - inheriting it from the team that runs it
         | manually, and automating it fully in Python. From them on,
         | changes to the process would run through a more traditional
         | software-development lifecycle, as you would be editing code.
         | 
         | For these devs - this is a feature not a bug! In Excel, version
         | control, testing, and review is pretty much non-existent...
         | 
         | Cool username btw...
        
       | LeoPanthera wrote:
       | > We use OpenAI APIs to translate formulas.
       | 
       | It's very uncool that you don't make this _very_ clear on the
       | landing page.
        
         | narush wrote:
         | Thanks for the feedback. I've updated the landing page to
         | prominently display this language - see the how it works
         | section.
        
       | gaze wrote:
       | Sorry "morgtage.xls"? This makes me feel a bit pessimistic about
       | the accuracy of the formulas.
        
         | narush wrote:
         | Woof. Always the words you stare at the most that are wrong...
         | will update that demo video when I get the chance, but might be
         | a bit :)
         | 
         | Good thing Pyoneer generates test cases for the formulas it
         | generates! No need to trust my spelling abilities -- your Excel
         | file is the ultimate source of truth.
        
       | jeffwiederkehr wrote:
       | I stumbled into a career in programming due to originally
       | learning excel in an unrelated field. This project is such a
       | great idea imo and seems like it would be a game changer for
       | anyone that is regularly working between excel and python files
       | for data.
       | 
       | Also my take on the transpilation vs LLM is that you chose the
       | right path. The point of transforming an excel spreadsheet to
       | python is to have better organization and ability to reason about
       | the operations being performed.
        
       ___________________________________________________________________
       (page generated 2024-05-23 23:01 UTC)