[HN Gopher] Show HN: Sketch - AI code-writing assistant that und...
___________________________________________________________________
Show HN: Sketch - AI code-writing assistant that understands data
content
Hey HN! I'm excited to share sketch: a tool to help anyone who
uses python and pandas quickly iterate and get to answers for their
data questions. Sketch installs as a pandas extension that offers
utility functions that operate on natural language prompts. Using
the `ask` interface you can get answers in natural language. Using
the `howto` interface you can get get python and pandas code
directly. The primary benefit of this over copilot and chatGPT is
that this adds data-content based context so that the generated
answers are much more accurate and relevant to the data problem at
hand. Check out the demo video[1] and try it out using the colab
notebook (on github)! [1] https://user-
images.githubusercontent.com/916073/212602281-4...
Author : bluecoconut
Score : 176 points
Date : 2023-01-16 13:33 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| tdebroc wrote:
| Looks really nice, but I tried it: import sketch
| import pandas as pd data_pd = pd.read_csv("input.csv",
| sep=';') print(data_pd) print(data_pd.sketch.ask("Is
| there any PII in this dataset ?"))
| print(data_pd.sketch.ask("Which columns are integer type?"))
|
| With this input.csv: name;age;address;phone
| Bob;34;106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY
| BLVD;1-541-754-3010 Anna;34;694 Short Street, Austin,
| Texas;001-541-754-3010
|
| And I have no results (and no runtime error as well) :-( Here is
| the console output: name age
| address phone 0 Bob 34 106 DOYERS ST. 8
| ARLINGTON DR. 599 NW BAY BLVD 1-541-754-3010 1 Anna
| 34 694 Short Street, Austin, Texas
| 001-541-754-3010 <IPython.core.display.HTML object>
| None <IPython.core.display.HTML object> None
|
| Am I missing something ? The "ask" interface doesn't seems to
| need external OpenAI credentials right ?
| bluecoconut wrote:
| to get the strings of the results back out, add the kwarg
| `call_display=False` to the functions.
|
| so: ``` print(data_pd.sketch.ask("Is there any PII in this
| dataset ?", call_display=False)) ``` should work for you.
|
| Right now it by default assumes its in an ipython context that
| can display HTML objects.
| tdebroc wrote:
| Ah yes it displayed the string, thanks!
|
| But the result looks wrong with this input:
| age address 0
| 34 106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD 1
| 34 694 Short Street, Austin, Texas
|
| It says: No, there is no PII (personally
| identifiable information) in this dataset. The only columns
| are index, age, and address, none of which contain any
| sensitive information.
|
| Sometimes, it seems to work with phone number though. Here:
| age address phone 0 34 106 DOYERS ST. 8 ARLINGTON DR. 599 NW
| BAY BLVD 1-541-754-3010 1 34 694 Short Street, Austin, Texas
| 001-541-754-3010 Yes, this dataset contains
| PII (personally identifiable information) such as age,
| address, and phone number.
|
| I retried: pirce
| address phone 0 123 106 DOYERS ST. 8
| ARLINGTON DR. 599 NW BAY BLVD 1-541-754-3010 1
| 43543 694 Short Street, Austin, Texas
| 001-541-754-3010 No, there is no personally
| identifiable information (PII) in this dataset. The columns
| contain only generic information such as index, price,
| address, and phone number. None of these columns contain any
| information that could be used to identify an individual.
|
| Which is wrong. Is there explanation ?
| ibestvina wrote:
| Great work, and a really interesting application of GPT3. Some
| time ago I developed Datasloth [1] which might be a nice
| complementary feature to Sketch. Ping me if you're interested to
| bounce ideas :)
|
| [1] https://github.com/ibestvina/datasloth
| gcatalfamo wrote:
| Cool project, although the name kinda clashes with the well-known
| https://www.sketch.com/ in the UI/UX design space
| daveguy wrote:
| This is very cool. A useful case for gpt. One question / concern:
| isn't a person's address considered PII? Is the system flexible
| enough to add pre-statements such as "treat an address as PII"?
| harvey9 wrote:
| Related question: is this done on my machine or do I end up
| sending possible pii to a cloud service for evaluation?
| bluecoconut wrote:
| This is sending summary statistics to a cloud machine by
| default (for ease of immediate use.
| https://github.com/approximatelabs/sketch#sketch-
| currently-u...
|
| You can run using your own OpenAI key by setting 2
| environment variables: (1)
| SKETCH_USE_REMOTE_LAMBDAPROMPT=False (2)
| OPENAI_API_KEY=YOUR_API_KEY
|
| To run entirely locally (using your own GPU and a model like
| Bloom) one would have to add a new prompt type to
| `lambdaprompt` (the package that this depends on), have a
| machine with enough GPU resources, and then add a slight
| modification to sketch.
| adabyron wrote:
| Not sure if this is a business you're building out of this
| or an experiment. For real use for any of my customers, I
| would need to run this entirely locally.
|
| I think it's really awesome though!
|
| Curious what "enough GPU resources" looks like? Would a
| GeForce RTX 40 or 30 series card with 12-24GB of RAM be
| sufficient per user running locally on their machine?
| irthomasthomas wrote:
| This is very cool! I've literally today been noodling with ideas
| to use probabilistic data structures in LLMs.
|
| And TIL you can embed mp4s in a GitHub readme. Is that new?
| sean_the_geek wrote:
| Really cool and helpful. Is there anything similar for R?
| pklee wrote:
| GPT3 model generates a SQL. You can sqldf on top of your
| data.table. We will be demo'ing at one of the events shortly.
| BTW, you could do somewhat similar with other LLMs such as GPTJ
| and GPT NEOX if you have worked with them
| rafaelmelhem wrote:
| is GPTJ/NEOX good enough to generate code? tried it with SQL
| and it was really disappointing
| jerpint wrote:
| Does using this mean sending all of your potentially private data
| via an api call to openAI?
| abrichr wrote:
| From
| https://github.com/approximatelabs/sketch/blob/main/sketch/p...
| it appears that this library is calling a remote API, which
| obviates the utility of the demonstrated use case.
|
| Upon closer inspection, it looks like
| https://github.com/approximatelabs/sketch interfaces with the
| model via https://github.com/approximatelabs/lambdaprompt,
| which is made by the same organization. This suggests to me
| that the former may be a toy demonstration of the latter.
|
| Interesting how as of the time of writing this, most of the
| comments here (i.e. dozens) are praising this as a legitimate
| use case. Maybe I'm missing something obvious, but it seems
| clear to me that uploading data to a third party to verify
| whether that data contains PII is a non-starter for any serious
| application.
| teaearlgraycold wrote:
| "Does this data contain PII?"
|
| "Yes, and you just shared it all with Microsoft :D"
| jonwinstanley wrote:
| Very cool demo!
|
| Regarding the choice of name, presumably you already know about
| Sketch, the popular image editing software.
|
| I wonder if the image editing guys will in the future incorporate
| AI functionality too? Which might make "Googling" for your
| product difficult for your potential customers?
| Jugglerofworlds wrote:
| There's also a program synthesis project called Sketch, which
| is much closer to the domain of what the user posted:
| https://people.csail.mit.edu/asolar/
| pfd1986 wrote:
| Hi, cool stuff! Which LLM is being used in the background? I may
| have missed that info in the readme. Thanks!
| swyx wrote:
| digging thru the code
| https://github.com/approximatelabs/sketch/blob/9d567ec161015...
|
| this seems to be using their gpt3 frameowrk:
| https://github.com/approximatelabs/lambdaprompt
|
| which uses text-davinci-003 by default
| https://github.com/approximatelabs/lambdaprompt/blob/main/la...
| bluecoconut wrote:
| Thanks!
|
| Right now this is running off of GPT-3 (`text-davinci-003`) and
| via a small code change can run on codex (`code-davinci-002`)
| but the quality only improves a little bit with that change.
|
| That said, this is the first version to show that the interface
| is viable; we are currently working on training our own
| foundation model on a hybrid tokenization of data and word
| tokens. I hope to improve this same toolkit in the future with
| these new models of our own that we are training.
| ethanwillis wrote:
| Well, I'm locked out of my github account right now and don't
| feel like going through all those hoops right now but I wanted to
| point something minor out.
|
| In this line,
| https://github.com/approximatelabs/sketch/blob/9d567ec161015...
|
| I think you can end up marking control characters as "UNKNOWN"
| characters by accident by assuming that in all
| contexts/environments that dictionary.items() always returns
| items in a consistent order. This isn't always true.
|
| edit: actually with the way the code is written if you have any
| overlapping ranges at all you'll end up double/triple/etc.
| counting a character into multiple categories.
| mmaia wrote:
| Very promising. I believe the uses of OpenAI that will stick in
| the long term are like this, and other tools should be
| experimenting with this kind of integration.
|
| Otherwise, there's room for other solutions, as airops sidekick
| [1] that uses browser extensions to embed itself in other data
| tools.
|
| 1- https://www.airops.com/
| hgarg wrote:
| I spent few weeks last year building a text to sql tool using
| codex model to do something like this but for all kinds of data
| sources. We pivoted away to something else for various reasons.
|
| But your approach is much better. Pandas is used a lot. Build a
| tool on top of pandas. This is awesome.
| javierluraschi wrote:
| https://hal9.com is focused on building data apps with LLMs,
| would love to explore integrating and contributing to Sketch. If
| this sounds interesting I'm at javier at hal9.ai
| drcongo wrote:
| I use TabNine [0] for local context aware AI suggestions, and I
| find it spookily good at guessing what I'm half way through
| typing. Sadly they've left the Sublime plugin to rot and it's
| mostly a hinderance in ST4.
|
| [0] https://www.tabnine.com
| ldh0011 wrote:
| So... Microsoft bought 48 or 49% of OpenAI right? Integrating
| this into Excel would make everyone an excel power user.
| bufferoverflow wrote:
| But if it makes a logical mistake, it would take a real power
| user to notice it.
| localhost wrote:
| But wouldn't you need to integrate Python into Excel for this
| to work?
| mmaia wrote:
| A lot of people already uses excelformulabot. The impact of
| something integrated into Excel would be pretty big.
| davidbressler wrote:
| It's already integrated into Excel with the add-on.
|
| What else did you have in mind?
| blakeburch wrote:
| This is fantastic and exactly where our team at Shipyard is
| expecting the data space to go. Context aware, AI driven. Great
| work on this!
|
| We were just talking last week about how we should create a
| feature to describe transformations you want in Natural Language
| that get compiled to pandas/SQL. Input data is everything
| associated with the original file/dataframe.
|
| Visual transformation tools are typically limited and non-
| reproducible. If you could switch it around to be code-compiled
| but description-driven, that would open up new possibilities.
|
| I'd love to chat if you're open to it. Email in bio.
| jadbox wrote:
| I'd love something like a standalone SQL IDE where I can ask an
| AI to generate queries or migration scripts.
|
| Sadly to be honest, I don't think I'd pay a subscription for
| such a service. I would prefer to pay a one time tooling fee
| and just run trained model in the IDE locally.
| rafaelmelhem wrote:
| I did something similar to it for my own use. Using natural
| language it make sql queries to your .csv, xlsx (soon I'll
| add features so you can connect to databases). but it is not
| mature enough to sell as a service. Feel free to reach me
| info [at] rafaelmelhem . com if you want and I send a demo :)
| vorpalhex wrote:
| Yeah the risk of your sql walking off to an AI vendor is not
| worth the time savings.
| swyx wrote:
| This is a great demo, OP.
|
| I'm wondering about the UX of this vs Copilot. is this basically
| just a way to get around the fact that you dont have Copilot
| inside of notebooks? what else am I missing about this
| experience?
| bluecoconut wrote:
| Thanks!
|
| That is definitely a big part of it, getting to use copilot
| style answers without having to install any plugins to the IDE
| (so getting to use this in colab or jupyter notebooks directly
| feels great).
|
| That said, I use both copilot and sketch in my VScode
| notebooks, and find that they have slightly different feelings
| to the iteration loop.
|
| Sketch offers a more "local" data context (pinning the
| text/prompt to the specific dataframe) which increases the
| quality of the suggestions (since more relevant information is
| within the token limit).
| allisdust wrote:
| I don't have any experience with pandas. Can this directly
| connect to a db and run queries there (video seems to load a csv
| file).
| harvey9 wrote:
| If you can already write SQL to return a data set then you can
| get that set to pandas with pyodbc.
| [deleted]
| jamal-kumar wrote:
| Damn, this looks pretty useful. I was finding that github copilot
| was really good at reading a CSV file and writing all the imports
| from that into migrations for DB import, but this looks like it
| does these data transformations even more robustly.
|
| Is there any plans on getting this to work outside of the
| python/pandas ecosystem or is it intrinsically tied to that
| environment?
___________________________________________________________________
(page generated 2023-01-16 23:00 UTC)