[HN Gopher] Show HN: Yobulk - Open-source CSV importer powered b...
___________________________________________________________________
Show HN: Yobulk - Open-source CSV importer powered by GPT3
Author : yosai
Score : 177 points
Date : 2023-02-21 14:05 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| cjtechie wrote:
| This is what exactly I was looking for. Will it help me to run
| big files and cleanse?
| yosai wrote:
| YoBulk uses buffer streaming internally.So you can upload a CSV
| of GBs size.You can try once at your end and let me know your
| feedback.
| cjtechie wrote:
| Thank you. Let me try this out
| bcrl wrote:
| This strikes me as an idea as bad as the Xerox document scanner
| that implemented a compression algorithm that changed digits.
| It'll be really fun debugging when something completely
| unexpected gets spit out of the neural network.
| yosai wrote:
| @bcrl we provide our own schema which is understood by our
| validation engine.It's an user option to use GPT or not..GPT
| output always comes with an disclaimer which might not be
| correct..We will be solving that slowly..
| yosai wrote:
| Hey Everybody,
|
| We are really excited to open source YoBulk today.
|
| YoBulk is an open source CSV importer for any SaaS application -
| It's a free alternative to https://flatfile.com/
|
| Why are we building YoBulk:
|
| In our previous startup, we were receiving CSV files from various
| billboard screen owners every day, following a specific template
| that we defined. Despite the well-defined template, the CSV files
| we received often contained manual errors, which was a challenge
| to fix with the data provider.
|
| We were receiving around 500,000 billboard data updates each day,
| including price changes and creative info data. It was a
| difficult and time-consuming job to clean and format the data to
| fit our database schema and upload it into the database. As a
| result, we wanted to automate the entire CSV importing process.
| In our second startup, we encountered similar challenges when
| cleaning large CSV files with location and timestamp data.
|
| We realised that more than 70% of business data is shared in CSV
| and Excel formats, and only a small percentage use API
| integrations for data exchange. As developers and product
| managers, we have experienced the difficulties of building a
| scalable CSV importer, and we know that many others face the same
| challenges. Our goal is to solve this problem by taking an open
| source AI and developer-centric approach
|
| Who can use YoBulk:
|
| YoBulk is a highly beneficial tool for a variety of
| professionals, such as Developers, Product Managers, Customer
| Success teams, and Marketers. It simplifies the process of
| onboarding and verifying customer data, making it an
| indispensable asset for those who deal with frequent CSV data
| uploads to a system with a predetermined schema or template.
|
| This tool is particularly valuable for updating sales CRM or
| product catalog data, and it effectively solves the initial
| challenge of customer data ingestion.
|
| The Problem:
|
| Importing a CSV is a really hard problem to solve. Some of the
| key problems are:
|
| 1.Missing of Collaboration and Automation in CSV importing
| workflow:
|
| In a usual situation, the customer success team responsible for
| receiving CSV data has to engage in extensive back-and-forth
| communication with the customer to address unintentional manual
| errors present in a CSV. This process requires a high level of
| collaboration and may even necessitate assistance from the
| customer's internal teams to correct the data. The entire
| workflow is currently manual and therefore needs to be
| automated.Being able to quickly see data errors and fix them on
| the spot in a collaborative way with the customer is the way
| forward.
|
| 2.Scale:CRM CSV files can sometimes reach sizes as large as 4 GB,
| making it nearly impossible to open them on a standalone machine
| for data correction. This presents a significant challenge for
| small businesses who cannot afford to invest in big data
| technologies such as EMR, Databrick, and ETL tools to address CSV
| import scaling problems.
|
| 3.Countless complex validation Types ::A single date format can
| have as many as 100 different variations, such as dd-mm-yyyy, mm-
| dd-yyyy, and dd.mm.yyyy. Manually setting validation rules for
| each of these formats is almost impossible, and correcting errors
| manually can also be difficult. Additionally, it can be
| challenging to comprehend errors without a human touch. Cross-
| validation between fields/columns is always a challenge in a
| specific CSV. For example, if a CSV contains two fields such as
| first name and age, creating custom validation to flag an error
| if the first name is missing and the age is greater than 50 can
| be really difficult.
|
| 4.Data mapping issues:In a typical scenario, the recipient of CSV
| data provides a template to the data donor and creates a CSV
| column to template mapping before importing. However, in many
| cases, the CSV column names do not match the corresponding
| template column names. For instance, the data receiver may
| provide a field labeled "EMP date of Joining," but the uploaded
| CSV may contain a field labeled "EMP DOJ." These mapping issues
| can significantly slow down the CSV importing process.
|
| 5.Data Security and Privacy:It is always risky to share your
| customer data with third-party companies for data cleaning
| purposes.
|
| 6.Non-availability of low code/No code tool: Product managers and
| customer success teams, who are typically no-code users, often
| rely on data analysts to create a programmed CSV template with
| validation rules, which must be shared with customers to receive
| CSV data in a specific format. However, in an ideal scenario, no-
| code users should be able to create a template independently,
| without depending on developers.
|
| 7.Vague error messages:Unclear error messages do not provide
| users with enough context to confidently resolve their issues
| before uploading their data. Without a specific explanation of
| the problem, users may have to try various fixes until they find
| one that works.Example:while uploading a CSV file to a portal, I
| had received an error like "baseID is null".i was clueless:)
|
| The Solution:
|
| 1. Smart Spreadsheet View: Designed to be a data exchange hub for
| any business that utilizes CSV files, YoBulk makes it easy to
| import and transform any CSV into a smart spreadsheet interface.
| This user-friendly interface highlights errors in a clear,
| concise manner, simplifying the task of cleaning data.
|
| 2. Bring your validation function: YoBulk offers a platform for
| Developers to create a custom CSV importer that includes
| personalized validation rules based on JSON schema. With this
| functionality, developers can design an importer that meets their
| specific needs and preferences.
|
| 3. AI first : YoBulk harnesses the power of OpenAI to provide
| advanced column matching, data cleaning and JSON schema
| generation features.
|
| 4. Build for Scale: YoBulk is designed for large-scale CSV
| validation, with the ability to process files in the gigabyte
| range without any glitches or errors.
|
| 5. Embeddable: Take advantage of YoBulk's customizable import
| button feature, which can be embedded on any SaaS or App. This
| allows you to receive CSV data in the exact format you require,
| streamlining your workflows.
|
| Hosting and Deployment:
|
| YoBulk can be self hosted and currently running on Mongo.
|
| Github : git clone git@github.com:yobulkdev/yobulkdev.git
|
| Getting started is really simple :
|
| Please refer https://doc.yobulk.dev/GetStarted/Installation
|
| Docker command: git clone
| https://github.com/yobulkdev/yobulkdev.git cd yobulkdev docker-
| compose up -d Or docker run --rm -it -p 5050:5050/tcp
| yobulk/yobulk Or git clone https://github.com/yobulkdev/yobulkdev
| cd yobulkdev yarn install yarn run dev
|
| Also please join our community at :
|
| - Github : https://github.com/yobulkdev/yobulkdev - Slack :
| https://join.slack.com/t/yobulkdev/signup. - Twitter :
| https://twitter.com/YoBulkDev - Reditt :
| https://reddit.com/r/YoBulk
|
| Would love to hear your feedback & how we can make this better.
|
| Thank you,
|
| Team YoBulk
| Mystery-Machine wrote:
| Please let someone proof read the Readme, it's embarrassing
| [deleted]
| yosai wrote:
| @Mystery-Machine happy to get your detail feedback on the
| Readme.We will correct it.
| hattermat wrote:
| wow - this is huge, wonder how a lot of the companies in this
| space will respond
| yosai wrote:
| some of the companies in this space>> https://flatfile.com/,h
| ttps://www.oneschema.co/,https://www....
| nerdponx wrote:
| This is a really interesting use of AI, and I think this has
| been a sought-after use case for a while. I recall the wave of
| "ML APIs" and auto-ML frameworks a few years ago that promised
| to use an ML model to automatically perform feature
| engineering, hyperparameter optimization, data cleaning, etc.,
| but never caught on as tools the hands of non-experts.
|
| However I'm surprised that this works in _completely automated_
| fashion. Given the fundamentally nondeterministic nature of
| language models, how do you ensure that the output is correct?
| Do you have a set of assertions that must become true about the
| data before the result is returned? How do you prevent the
| model from being too clever with your assertions, and replacing
| the data with all 0s or something similarly a la Asimov 's
| Three Laws of Robotics (see eg
| https://en.wikipedia.org/wiki/Runaround_(story))?
| yosai wrote:
| @nerdponx This is really a great question. We are currently
| using AI for schema generation as well as column matching.
| The column matching is done with the Dice's Coefficient in
| yobulk system. But with chatgpt's column matcher, we are
| leveraging the model of chatgpt to match the columns.
| Further, there is a roadmap for auto cleaning by keeping the
| historical records and bulding a model to sense the data type
| entered into the csv for the specific organization.We give
| the user final power to decide which if the GPT output is
| correct or not.Happy to engage with you on this topic.
| hermitcrab wrote:
| I take issue with "Non-availability of low code/No code tool".
| There are plenty of no-code and low-code ETL tools that are
| heavily used for reading, re-formatting and restructuring CSV
| files. For exanple out own Easy Data Transform, which is a drag
| and drop data transformation tool aimed very much at business
| users, rather than professional data scientists.
| yosai wrote:
| @hermitcrab here we mean no code /low code for validation
| template creation.We are not a ETL tool.Yes we do ETL
| operation internally.YoBulk is a flatfile.com alternative and
| primarily meant for data donor..we provide a spreadsheet view
| for the data donor who is mostly non tech guy to intuitively
| solve data errors..It's not meant for data scientists..
| dontcontactme wrote:
| "Open source tool powered by closed source API" Is it really open
| source then?
| data_ders wrote:
| wait... if importing malformed csvs gets automated that's like
| half of a data professional's job gone in a poof of smoke /s. jk
| -- great use case
|
| so often w/ pandas I'd: 1. "yeet" the csv into a dataframe 2. use
| dataframe methods to massage the data to a "clean" state 3. push
| as much of the df methods into pd.read_csv() parameter options
|
| it's be great to iterate more quickly on the above loop. better
| yet -- what if it would could auto-generate a letter to send to
| the folks from whom you got this data on how they could better
| output to csv to make ingestion simpler and easier for downstream
| users.... but maybe that letter would just be "don't use CSV!"
|
| related to flat data formats, it obviously makes sense to start
| with CSV, but what about the future? If this tool became
| ubiquitous, how might a SWE or data professional's job change?
| What opportunities be created? As in:
|
| 1. CSV is ubiquitous but has no singularly well-adopted standard.
| 2. software and data engineers struggle with CSVs as a result of
| #1. 3. tool is created to reduce pain and friction. 4. profit? a
| new market? a new standard?
|
| Last, but most personally interestingly, how much do you know
| about the Apache Arrow ecosystem and how it's mission might
| overlap with YoBulk's
| sgerenser wrote:
| _1. CSV is ubiquitous but has no singularly well-adopted
| standard. 2. software and data engineers struggle with CSVs as
| a result of #1. 3. tool is created to reduce pain and friction.
| 4. profit? a new market? a new standard?_
|
| The "revolutionary new tool" to replace CSVs was XML in the
| late 1990s.
| yosai wrote:
| @sgerenser Yes CSV is everywhere..YoBulk is smartly
| positioning itself for the data donor/provider or
| customer..It is the end customer or data provider who bite
| the bullet and do the time consuming data cleaning.The
| customer should know about the errors, duplicates,PII
| data,inconsitency in data.Customer has to be properly guided
| to clean the data in best possible manner.
| nerdponx wrote:
| Except not at all. XML is harder to enter by hand than CSV
| and not mess it up. CSV optimizes for the easiest cases and
| performs well on them. XML optimizes for the most complicated
| cases and therefore performs poorly on the easiest cases.
| JSON is somewhere in the middle. The main problems with CSV
| have to do with 1) MS Excel and 2) some kind of delusion
| among programmers that formatting or parsing arbitrary data
| is easy and you don't need a library for it, so you get hand-
| rolled generators and parsers that emit broken files.
|
| Otherwise, the problem with CSV has little to do with the CSV
| format as such and more to do with the fact that the data is
| stringly-typed. XML has the same problem. JSON interestingly
| does not. Everything has tradeoffs.
| aforwardslash wrote:
| nitpick, I wouldn't place JSON in the middle (the lack of
| proper integers and precision problems is one of the
| issues). but other than that, spot on.
| yosai wrote:
| @nerdponx You are spot on..
| refulgentis wrote:
| That's really interesting, I wonder if this simplifies down
| to "you want CSV with column typing and a typesafe CSV
| editor", as you note, JSONs win is the lack of issues with
| typing, and CSV really isn't complicated at all except for
| that property. JSON is just a row with keys that are
| column.
| nerdponx wrote:
| I definitely want that! Parquet is great for data
| interchange, but it's not easily hand-editable. I wonder
| if there's an open niche in the software world for an
| Excel-like data entry and manipulation tool, but with
| stronger/stricter typing of cells and columns, and with
| direct export to and import from SQLite and Parquet.
| Fnoord wrote:
| To solve the XML issues you described we got schemas and
| syntax highlighting.
|
| I hate non-prettified JSON but its easy to prettify in any
| editor. So its a meh argument against JSON. But to solve
| the crap with the comma one needs a variant of JSON, and
| there's various of these...
|
| One other neat feature of CSV is it can be imported in a
| very popular and powerful IDE, called... Excel.
| refulgentis wrote:
| yeet is dispose of with haste, not "move, but zoomer" or
| "sloppily with haste"
| chaps wrote:
| Wot, no. It's throw hastily we with no care of what it splats
| into. Pretty sure it came from a video of someone throwing
| something (food or drink?) in a crowded highschool hallway,
| from within the hallway. It's chaotic, reckless energy in..
| mostly.. harmless form.
| wrycoder wrote:
| You're both saying the same thing from my pov. But thanks
| for the translations!
| chaps wrote:
| Now that I'm not a phone, here's the video:
| https://www.youtube.com/watch?v=2Bjy5YQ5xPc
| recursive wrote:
| > CSV is ubiquitous but has no singularly well-adopted standard
|
| RFC4180 exists regardless of adoption level. In a way, the
| simplicity of the spec causes the proliferation of grammars. No
| one thinks they can just yolo a PDF by hand in a text editor.
| Ok, maybe PDF is a bad example. But CSV (as specified in
| RFC4180) is so dead simple that people take shortcuts.
| yosai wrote:
| Yes you are absolutely right..We need a solution beyond
| standard as 80% businesses run on CSV..
| ed_elliott_asc wrote:
| 99.99%?
| groestl wrote:
| And that's only the portion the businesses know about.
| thedudeabides5 wrote:
| working on it, this is a hard problem actually
| anothernewdude wrote:
| You'd have to be insane to trust GPT that much. I wouldn't want
| anything hallucinated in my data.
| boringg wrote:
| "half of a data professional's job" ... you mean like 90%
| yosai wrote:
| @data_ders We realized that more than 70% of business data is
| shared in CSV and Excel formats, and only a small percentage
| use API integrations for data exchange..So CSV is here to stay
| for sure..On the other side,data engine is a sub module inside
| YoBulk.We are trying to automate complete CSV importing
| workflow..mostly solving the CSV errors in a collaborative way
| with the data donor..Yobulk's USP is how we show the errors in
| human readable way.We have written wrapper on top of some open
| source data validation engines.Yes..I have used apache
| Arrow..we not competing with Apache Arrow..We are creating an
| alternative to flatfile.com
| faebi wrote:
| I still like the jsonl standard quite a lot. JSON is pretty
| much universal, yet it's better structured than csv. The
| ,l'ine in jsonl makes it easier parsable and independent from
| the remaining structure. Keys are duplicated hard but that's
| where gzip comes in.
| chaps wrote:
| Man, been down this path for a long while. It gets tough!
| Flattening csvs with hierarchical headers (as in, headers
| that that apply a category to a second row of headers) are
| tough.
|
| The ways csv can fail is just fucking nuts. Especially when
| they're half hand written, half automated, or where a failure
| is 20m rows in. Hard to have speed and strong checks
| simultaneously.
| yosai wrote:
| Yes you are right..In YoBulk we flatten the CSV to a JSON
| schema store it in a document DB and do all the
| validations.Chunking the CSV and analysing the stream
| buffers for validation is giving us speed also.
| silent_cal wrote:
| pd.yeet("/data/data_1.csv")
|
| pd.yeet(lambda x: yeet(x))
|
| pd.yeet_to_csv("/clean_data/cleaned_data_1.csv")
| dstala wrote:
| > YoBulk harnesses the power of OpenAI to provide advanced column
| matching
|
| @yosai, can you give an example? just curious
| yosai wrote:
| @dstala thanks for exploring YoBulk.Under the hood,YoBulk uses
| Open AI apis which takes the uploaded CSV column name and
| template column as an input and gives accurate matching.You can
| try the product and let me know if you have any comment.
| anoonmoose wrote:
| I've been looking for an AI/GPT/deep learning tool that would
| help me perform some sanitation and normalization of a large data
| set that's quite personal to me- my last.fm data, time-stamped
| logs of (nearly) every song I've listened to for almost twenty
| years now. The data has all kinds of issues- for example,
| yesterday I realized that I had two sets of logs for one album.
| One version of the album used U+2026 (...) and one used three
| periods (...). There are problems like that, stuff more akin to
| typos, styling stuff (& vs and), or even garbage-in garbage-out
| stuff (YouTube Music changing the tags on the same album over
| time making it look like I actually listened to different albums,
| or not actually having all of the tags they're supposed to have).
|
| I've got .NET code that hits the last.fm api and dumps the info
| to a LiteDB database, so I can export to CSV pretty easily if
| this tool would be useful to me, unless anyone has any better
| directions to point me in. Appreciate any thoughts you folks
| have.
| a_subsystem wrote:
| We're using PowerBI for this kind of thing.
|
| It's certainly not open source, but you put in wonky tables,
| give it a couple/few examples of how you would like it to be,
| and it uses AI to spit out clean tables for export.
|
| I'm not a fan of proprietary working files, but if that ever
| becomes a problem, at least we've still got the data.
| nerdponx wrote:
| In the case of Unicode at least, the Unicode consortium
| maintains a database of "confusable" characters and a tool to
| detect them:
| https://util.unicode.org/UnicodeJsps/confusables.jsp?a=%E2%8...
|
| You can download the database for use in your own programs, and
| there is at least one Python package built around it:
| https://pypi.org/project/confusables/
| carterschonwald wrote:
| Is there a schema document explaining the format of the
| dataset?
|
| Edit Found it in the associated doc It's a cute approach !
|
| Data File Format
|
| Each line in the data file has the following format: Field 1
| is the source, Field 2 is the target, and Field 3 is
| obsolete, always containing the letters "MA" for backwards
| compatibility. For example:
|
| 0441 ; 0063 ; MA # ( s - c ) CYRILLIC SMALL LETTER ES - LATIN
| SMALL LETTER C #
|
| 2CA5 ; 0063 ; MA # ( - c ) COPTIC SMALL LETTER SIMA - LATIN
| SMALL LETTER C # -c-
|
| Everything after the # is a comment and is purely
| informative. A asterisk after the comment indicates that the
| character is not an XID character [UAX31]. The comments
| provide the character names.
|
| Implementations that use the confusable data do not have to
| recursively apply the mappings, because the transforms are
| idempotent. That is,
|
| skeleton(skeleton(X)) = skeleton(X)
| IanCal wrote:
| If you've got borked encodings around as well, the python
| package ftfy is wonderful: https://pypi.org/project/ftfy/
|
| Undoes whatever on earth it is excel does, helps clean up
| bits of html/etc.
| 0live wrote:
| I would suggest https://github.com/OpenRefine/OpenRefine to
| clean your data.
| anoonmoose wrote:
| Love this suggestion, excited to check it out!
| yosai wrote:
| @0live Cleaning data is only a module.YoBulk helps you to
| automate complete CSV import workflow.please read our blog
| https://www.yobulk.dev/blog/Building%20an%20In-
| house%20CSV%2... to understand the CSV workflow problem.Happy
| to answer you queries.
| aarondia wrote:
| From the blog:
|
| > In a typical scenario, the customer success team who is
| in charge of this activity has to work back and forth with
| the customer. The customer has to resolve manual
| (unintended) errors.
|
| + 100. This is even the case when analysts are working with
| datasets that are created by their colleagues at the same
| company. Since most companies don't have clear standards
| for column header labelling, etc. getting a new dataset and
| incorporating it into an existing workflow requires
| collaboration with others from inside your company.
| yosai wrote:
| Thanks for resonating the problem statement.Yes internal
| teams also face the issue.You Reply on the other team to
| clean the data.YoBulk is automating this workflow where
| both data donor and receiver solve the data errors in a
| much collaborative way.
| yosai wrote:
| @anoonmoose we have an internal pipeline which streams the
| monogoDB data to any CSV or any webhook url path.it's a export
| pipeline which streams the processed data to CSVs.We will
| expose an API in coming days which will fit your usecase.
| breck wrote:
| It's still early, but TreeBase might be worth a look.
| (https://jtree.treenotation.org/treeBase/index.html)
|
| It's the public domain software that powers PLDB.com and
| CancerDB.com.
|
| You store your data in Tree Notation in plain text files, and
| use the Grammar Language (a Tree Language) for schemas, which
| also enforces correctness. You use Git for version control. You
| then can query the data using TQL (also a Tree Language). You
| can display your data using Scroll (also a Tree Language)
|
| So your data, your query language, your schemas, your display
| language are all in the same simple plain-text notation: Tree
| Notation. Of course, there's also a lot of Javascript glue.
|
| Very little documentation at the moment, and it's brand new,
| but it simple was not possible before the M1s, which came out
| in December 2020, and the growth rate is very good.
|
| It's all signal, no noise, so it's a timeless solution, and you
| won't regret putting your data in there.
| aarondia wrote:
| Its not an AI based approach, but it is a step up from writing
| code by hand -- you could try using open source Mito ->
| https://www.trymito.io -> full disclosure I built it -> to do
| some of this messy data wrangling. Mito lets you view and
| manipulate your data in a spreadsheet in Jupyter and it
| generates the equivalent Python code for each edit. For things
| like identifying that the data uses '&' and 'and', viewing your
| data in a spreadsheet is >> just writing code.
|
| Once you generate the code, you could copy it into your
| pipeline so that you pull the code from the last.fim api,
| preprocess it with the Python code that Mito generated, and
| then dump it into the LiteDB.
| [deleted]
| WhiteNoiz3 wrote:
| When I read the headline I thought this would take a few rows of
| your CSV file and generate the schema from that using AI. Seems
| like you still need to manually describe the columns.
| yosai wrote:
| Yes, we have a workflow for your usecase.YoBulk can create a
| template or schema by uploading a CSV.We read some lines and
| create the schema.Right now we have not added AI for that.This
| flow is very handy for the usecase like when you want to upload
| a CSV file to hubspot or linkedin and want to do the data
| cleaning according to hubspot and linkedin defined
| template.People can use upload linkedin or hubspot template CSV
| to YoBulk and create a template.Now they can validate their CSV
| data against the YoBulk template before uploading to hubspot
| and linkedin portal.
| iLoveOncall wrote:
| This is a trivial problem you can solve in a hundred lines in
| any programming language, you don't need AI.
| blowski wrote:
| It's also something you can do manually with a few admin
| people, or perhaps using a COBOL script. Innovation means
| we'll have different ways of doing the same thing.
| jimlongton wrote:
| Does this send _all_ the data to a third party? What if it
| contains personal information?
| yosai wrote:
| No datas are sent to any 3rd party.YoBulk is self hosted.Your
| personal information is stored on your database.Feel free to
| ask any data security related question.
| counttheforks wrote:
| How are you running GPT locally? OpenAI is a third party.
| wstuartcl wrote:
| it looks like the only thing openai does is generate a
| schema from a specific hand written prompt from cursory
| overview.
| yosai wrote:
| There are multiple use cases where we are using Open AI
| Example:uploaded CSV's column to template column string
| matching
| jimlongton wrote:
| That makes sense. It would be good to add this
| information to the README.
| yosai wrote:
| Sure We will add it in README..we captured it in our
| documentation.Please have a look.
| https://doc.yobulk.dev/YoBulk%20AI/AI%20usecases
| yosai wrote:
| We are using OpenAI to only to create schema, column
| matching and Regex generation.No CSV data are sent to open
| AI..
| nerdponx wrote:
| So you ask OpenAI to generate a schema and computer code
| that will clean input data to conform to that schema, and
| then run the user's data through that program? Is it
| possible for users to obtain and audit the generated code
| for correctness, performance, etc.? How do you prevent
| things like the AI from generating catastrophically-
| backtracking regex, O(N!) algorithms, or outright
| mistakes?
| yosai wrote:
| @nerdponx Open AI only generate a validation schema.We
| have our own schema generation engine for any custom
| validation which can be used where Open AI is not able to
| understand or generate correct schema.Yes you are
| absolutely right.Open AIs output is not always right.We
| have integrated a JSON parser which validate Open AI
| output and currently developing a regex parser which
| validates Open AIs output.Hope it answers your
| query..Happy to understand your pain point more.
| nerdponx wrote:
| Interesting, thanks for explaining how it works. Would it
| also be possible to construct one of these JSON schemas
| by hand, without OpenAI? The core data cleaning system
| sounds like just as interesting a piece of technology as
| the AI schema generation.
| yosai wrote:
| @nerdponx Yes YoBulk provides a way to write JSON schemas
| by hand..We have added some custom keyword like validate
| which is not defined in JSON standard..But our validation
| engine can understand that..You can pass a javascript
| function through validate keyword..It's a game changer..
| "first_name":{ "type":"string", "format": "first_name",
| "validate": "(x) => {\r\n let regex = new
| RegExp(\"([a-z][a-zA-Z]*)\");\r\n return
| regex.test(x);\r\n }" },
| nerdponx wrote:
| This is a very interesting combination of technologies.
| Thank you again for explaining how they work! I tend to
| prefer open-source solutions for my own work, but I can
| see this being highly useful and valuable for many
| businesses.
| justeleblanc wrote:
| So what exactly is sent to OpenAI?
| yosai wrote:
| @justeleblanc we do auto column matching through open
| AI..Ex:If you have defined a template with column name
| Date of Joining and the CSV data donor upload a CSV with
| a field DOJ,then your validation engine will skip the
| importing as it does not match with expected column
| names..Here Open AI comes handy..It gets the context and
| smartly identify that DOJ is same as Date of Joining and
| does the importing..You can go through
| https://doc.yobulk.dev/YoBulk%20AI/AI%20usecases to
| understand more on AI usecases of YoBulk.
| dang wrote:
| Sorry for the offtopicness but can you please email me at
| hn@ycombinator.com?
| cdolan wrote:
| Cheers to OP for whatever magic this interaction has
| unlocked!
| dang wrote:
| I resort to comments like that when I don't have another
| way of contacting the user. If it happens in a Show HN
| thread, it's probably that I want to ask if they've
| considered applying to YC with the project. People often
| underestimate what YC might be interested in. Open-source
| startups are a big part of what gets funded these days. I
| particularly love it when startups make it in to YC
| through HN--it's great for both YC and HN, and if a big
| success ever came out of it, it would easily fund HN for
| another century :)
|
| If I post a please-email-me outside of a Show HN thread,
| it might be anything! but the most common reason is that
| I want to send them a repost invite for some cool article
| they posted long ago. Invited reposts make it into the
| second-chance pool (see https://news.ycombinator.com/pool
| and https://news.ycombinator.com/item?id=26998308).
| l33t233372 wrote:
| I understand that this product does use OpenAI's API, but I
| just want to stress that OpenAI doesn't own GPT, it did not
| create transformer models, and GPT is not something usable
| exclusively through OpenAI.
|
| A priori, there's no reason they couldn't include their own
| GPT model that also lives on your own server.
| olalonde wrote:
| According to Wikipedia, OpenAI did create GPT[0] and the
| GPT-3 model[1], which this tool depends on, is
| exclusively available through the OpenAI API. It's true
| that some open alternatives to GPT-3 have been popping up
| but I believe they are still behind in terms of quality.
|
| [0] https://en.wikipedia.org/wiki/Generative_pre-
| trained_transfo...
|
| [1] https://en.wikipedia.org/wiki/GPT-3
| l33t233372 wrote:
| Thank you for adding a correction. I shouldn't have been
| so flippant: OpenAI did not invent transformers, GPT-3 is
| an iteration of the BERT architecture, which is an
| encoder, whereas GPT is a decoder model.
|
| Regardless, a GPT model is not a proprietary technology
| you must get from OpenAI. You're right that this tool
| uses OpenAI's API, but this isn't necessarily implied by
| the use of GPT.
| coolspot wrote:
| > Thank you for adding a correction. I shouldn't have
| been so flippant: OpenAI did not invent transformers
|
| BTW you answer structure is very similar to ChatGPT's
| when you point it to a mistake.
| l33t233372 wrote:
| Well to be fair GPT-3 was trained on internet comments!
| elevenoh wrote:
| [dead]
| yosai wrote:
| @l33t233372 you are absolutely spot on.
| [deleted]
| [deleted]
| yarapavan wrote:
| This looks good and promising! Congrats and best wishes, yosai!!
| yosai wrote:
| @yarapavan Thanks.Please explore the product!!
| PaulHoule wrote:
| Around the time that BERT and fasttext were just coming out, I
| worked at a startup that had built a system that used text CNNs
| to interpret CSV files, particularly we had models that profiled
| at the level of individual cells by classifying either the
| content of the cell plus the content of the cell plus the label
| of the column.
|
| I was thinking we were going to get bought by one of our big
| clients like Airbus or a major accounting firm but actually the
| firm got bought by a major shoe and clothing brand. I still wear
| swag from that employer to the gym sometimes so I like to think
| it got transmigrated when the acquisition happened.
| Der_Einzige wrote:
| My guess is Zelando right? I was always intrigued at the
| quality of NLP research coming out of them!
| PaulHoule wrote:
| No, it was this
|
| https://fdra.org/latest-news/nike-acquires-data-
| integration-...
___________________________________________________________________
(page generated 2023-02-21 23:00 UTC)