hngopher.com

       [HN Gopher] Show HN: Yobulk - Open-source CSV importer powered b...
       ___________________________________________________________________
        
       Show HN: Yobulk - Open-source CSV importer powered by GPT3
        
       Author : yosai
       Score  : 177 points
       Date   : 2023-02-21 14:05 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | cjtechie wrote:
       | This is what exactly I was looking for. Will it help me to run
       | big files and cleanse?
        
         | yosai wrote:
         | YoBulk uses buffer streaming internally.So you can upload a CSV
         | of GBs size.You can try once at your end and let me know your
         | feedback.
        
           | cjtechie wrote:
           | Thank you. Let me try this out
        
       | bcrl wrote:
       | This strikes me as an idea as bad as the Xerox document scanner
       | that implemented a compression algorithm that changed digits.
       | It'll be really fun debugging when something completely
       | unexpected gets spit out of the neural network.
        
         | yosai wrote:
         | @bcrl we provide our own schema which is understood by our
         | validation engine.It's an user option to use GPT or not..GPT
         | output always comes with an disclaimer which might not be
         | correct..We will be solving that slowly..
        
       | yosai wrote:
       | Hey Everybody,
       | 
       | We are really excited to open source YoBulk today.
       | 
       | YoBulk is an open source CSV importer for any SaaS application -
       | It's a free alternative to https://flatfile.com/
       | 
       | Why are we building YoBulk:
       | 
       | In our previous startup, we were receiving CSV files from various
       | billboard screen owners every day, following a specific template
       | that we defined. Despite the well-defined template, the CSV files
       | we received often contained manual errors, which was a challenge
       | to fix with the data provider.
       | 
       | We were receiving around 500,000 billboard data updates each day,
       | including price changes and creative info data. It was a
       | difficult and time-consuming job to clean and format the data to
       | fit our database schema and upload it into the database. As a
       | result, we wanted to automate the entire CSV importing process.
       | In our second startup, we encountered similar challenges when
       | cleaning large CSV files with location and timestamp data.
       | 
       | We realised that more than 70% of business data is shared in CSV
       | and Excel formats, and only a small percentage use API
       | integrations for data exchange. As developers and product
       | managers, we have experienced the difficulties of building a
       | scalable CSV importer, and we know that many others face the same
       | challenges. Our goal is to solve this problem by taking an open
       | source AI and developer-centric approach
       | 
       | Who can use YoBulk:
       | 
       | YoBulk is a highly beneficial tool for a variety of
       | professionals, such as Developers, Product Managers, Customer
       | Success teams, and Marketers. It simplifies the process of
       | onboarding and verifying customer data, making it an
       | indispensable asset for those who deal with frequent CSV data
       | uploads to a system with a predetermined schema or template.
       | 
       | This tool is particularly valuable for updating sales CRM or
       | product catalog data, and it effectively solves the initial
       | challenge of customer data ingestion.
       | 
       | The Problem:
       | 
       | Importing a CSV is a really hard problem to solve. Some of the
       | key problems are:
       | 
       | 1.Missing of Collaboration and Automation in CSV importing
       | workflow:
       | 
       | In a usual situation, the customer success team responsible for
       | receiving CSV data has to engage in extensive back-and-forth
       | communication with the customer to address unintentional manual
       | errors present in a CSV. This process requires a high level of
       | collaboration and may even necessitate assistance from the
       | customer's internal teams to correct the data. The entire
       | workflow is currently manual and therefore needs to be
       | automated.Being able to quickly see data errors and fix them on
       | the spot in a collaborative way with the customer is the way
       | forward.
       | 
       | 2.Scale:CRM CSV files can sometimes reach sizes as large as 4 GB,
       | making it nearly impossible to open them on a standalone machine
       | for data correction. This presents a significant challenge for
       | small businesses who cannot afford to invest in big data
       | technologies such as EMR, Databrick, and ETL tools to address CSV
       | import scaling problems.
       | 
       | 3.Countless complex validation Types ::A single date format can
       | have as many as 100 different variations, such as dd-mm-yyyy, mm-
       | dd-yyyy, and dd.mm.yyyy. Manually setting validation rules for
       | each of these formats is almost impossible, and correcting errors
       | manually can also be difficult. Additionally, it can be
       | challenging to comprehend errors without a human touch. Cross-
       | validation between fields/columns is always a challenge in a
       | specific CSV. For example, if a CSV contains two fields such as
       | first name and age, creating custom validation to flag an error
       | if the first name is missing and the age is greater than 50 can
       | be really difficult.
       | 
       | 4.Data mapping issues:In a typical scenario, the recipient of CSV
       | data provides a template to the data donor and creates a CSV
       | column to template mapping before importing. However, in many
       | cases, the CSV column names do not match the corresponding
       | template column names. For instance, the data receiver may
       | provide a field labeled "EMP date of Joining," but the uploaded
       | CSV may contain a field labeled "EMP DOJ." These mapping issues
       | can significantly slow down the CSV importing process.
       | 
       | 5.Data Security and Privacy:It is always risky to share your
       | customer data with third-party companies for data cleaning
       | purposes.
       | 
       | 6.Non-availability of low code/No code tool: Product managers and
       | customer success teams, who are typically no-code users, often
       | rely on data analysts to create a programmed CSV template with
       | validation rules, which must be shared with customers to receive
       | CSV data in a specific format. However, in an ideal scenario, no-
       | code users should be able to create a template independently,
       | without depending on developers.
       | 
       | 7.Vague error messages:Unclear error messages do not provide
       | users with enough context to confidently resolve their issues
       | before uploading their data. Without a specific explanation of
       | the problem, users may have to try various fixes until they find
       | one that works.Example:while uploading a CSV file to a portal, I
       | had received an error like "baseID is null".i was clueless:)
       | 
       | The Solution:
       | 
       | 1. Smart Spreadsheet View: Designed to be a data exchange hub for
       | any business that utilizes CSV files, YoBulk makes it easy to
       | import and transform any CSV into a smart spreadsheet interface.
       | This user-friendly interface highlights errors in a clear,
       | concise manner, simplifying the task of cleaning data.
       | 
       | 2. Bring your validation function: YoBulk offers a platform for
       | Developers to create a custom CSV importer that includes
       | personalized validation rules based on JSON schema. With this
       | functionality, developers can design an importer that meets their
       | specific needs and preferences.
       | 
       | 3. AI first : YoBulk harnesses the power of OpenAI to provide
       | advanced column matching, data cleaning and JSON schema
       | generation features.
       | 
       | 4. Build for Scale: YoBulk is designed for large-scale CSV
       | validation, with the ability to process files in the gigabyte
       | range without any glitches or errors.
       | 
       | 5. Embeddable: Take advantage of YoBulk's customizable import
       | button feature, which can be embedded on any SaaS or App. This
       | allows you to receive CSV data in the exact format you require,
       | streamlining your workflows.
       | 
       | Hosting and Deployment:
       | 
       | YoBulk can be self hosted and currently running on Mongo.
       | 
       | Github : git clone git@github.com:yobulkdev/yobulkdev.git
       | 
       | Getting started is really simple :
       | 
       | Please refer https://doc.yobulk.dev/GetStarted/Installation
       | 
       | Docker command: git clone
       | https://github.com/yobulkdev/yobulkdev.git cd yobulkdev docker-
       | compose up -d Or docker run --rm -it -p 5050:5050/tcp
       | yobulk/yobulk Or git clone https://github.com/yobulkdev/yobulkdev
       | cd yobulkdev yarn install yarn run dev
       | 
       | Also please join our community at :
       | 
       | - Github : https://github.com/yobulkdev/yobulkdev - Slack :
       | https://join.slack.com/t/yobulkdev/signup. - Twitter :
       | https://twitter.com/YoBulkDev - Reditt :
       | https://reddit.com/r/YoBulk
       | 
       | Would love to hear your feedback & how we can make this better.
       | 
       | Thank you,
       | 
       | Team YoBulk
        
         | Mystery-Machine wrote:
         | Please let someone proof read the Readme, it's embarrassing
        
           | [deleted]
        
           | yosai wrote:
           | @Mystery-Machine happy to get your detail feedback on the
           | Readme.We will correct it.
        
         | hattermat wrote:
         | wow - this is huge, wonder how a lot of the companies in this
         | space will respond
        
           | yosai wrote:
           | some of the companies in this space>> https://flatfile.com/,h
           | ttps://www.oneschema.co/,https://www....
        
         | nerdponx wrote:
         | This is a really interesting use of AI, and I think this has
         | been a sought-after use case for a while. I recall the wave of
         | "ML APIs" and auto-ML frameworks a few years ago that promised
         | to use an ML model to automatically perform feature
         | engineering, hyperparameter optimization, data cleaning, etc.,
         | but never caught on as tools the hands of non-experts.
         | 
         | However I'm surprised that this works in _completely automated_
         | fashion. Given the fundamentally nondeterministic nature of
         | language models, how do you ensure that the output is correct?
         | Do you have a set of assertions that must become true about the
         | data before the result is returned? How do you prevent the
         | model from being too clever with your assertions, and replacing
         | the data with all 0s or something similarly a la Asimov 's
         | Three Laws of Robotics (see eg
         | https://en.wikipedia.org/wiki/Runaround_(story))?
        
           | yosai wrote:
           | @nerdponx This is really a great question. We are currently
           | using AI for schema generation as well as column matching.
           | The column matching is done with the Dice's Coefficient in
           | yobulk system. But with chatgpt's column matcher, we are
           | leveraging the model of chatgpt to match the columns.
           | Further, there is a roadmap for auto cleaning by keeping the
           | historical records and bulding a model to sense the data type
           | entered into the csv for the specific organization.We give
           | the user final power to decide which if the GPT output is
           | correct or not.Happy to engage with you on this topic.
        
         | hermitcrab wrote:
         | I take issue with "Non-availability of low code/No code tool".
         | There are plenty of no-code and low-code ETL tools that are
         | heavily used for reading, re-formatting and restructuring CSV
         | files. For exanple out own Easy Data Transform, which is a drag
         | and drop data transformation tool aimed very much at business
         | users, rather than professional data scientists.
        
           | yosai wrote:
           | @hermitcrab here we mean no code /low code for validation
           | template creation.We are not a ETL tool.Yes we do ETL
           | operation internally.YoBulk is a flatfile.com alternative and
           | primarily meant for data donor..we provide a spreadsheet view
           | for the data donor who is mostly non tech guy to intuitively
           | solve data errors..It's not meant for data scientists..
        
       | dontcontactme wrote:
       | "Open source tool powered by closed source API" Is it really open
       | source then?
        
       | data_ders wrote:
       | wait... if importing malformed csvs gets automated that's like
       | half of a data professional's job gone in a poof of smoke /s. jk
       | -- great use case
       | 
       | so often w/ pandas I'd: 1. "yeet" the csv into a dataframe 2. use
       | dataframe methods to massage the data to a "clean" state 3. push
       | as much of the df methods into pd.read_csv() parameter options
       | 
       | it's be great to iterate more quickly on the above loop. better
       | yet -- what if it would could auto-generate a letter to send to
       | the folks from whom you got this data on how they could better
       | output to csv to make ingestion simpler and easier for downstream
       | users.... but maybe that letter would just be "don't use CSV!"
       | 
       | related to flat data formats, it obviously makes sense to start
       | with CSV, but what about the future? If this tool became
       | ubiquitous, how might a SWE or data professional's job change?
       | What opportunities be created? As in:
       | 
       | 1. CSV is ubiquitous but has no singularly well-adopted standard.
       | 2. software and data engineers struggle with CSVs as a result of
       | #1. 3. tool is created to reduce pain and friction. 4. profit? a
       | new market? a new standard?
       | 
       | Last, but most personally interestingly, how much do you know
       | about the Apache Arrow ecosystem and how it's mission might
       | overlap with YoBulk's
        
         | sgerenser wrote:
         | _1. CSV is ubiquitous but has no singularly well-adopted
         | standard. 2. software and data engineers struggle with CSVs as
         | a result of #1. 3. tool is created to reduce pain and friction.
         | 4. profit? a new market? a new standard?_
         | 
         | The "revolutionary new tool" to replace CSVs was XML in the
         | late 1990s.
        
           | yosai wrote:
           | @sgerenser Yes CSV is everywhere..YoBulk is smartly
           | positioning itself for the data donor/provider or
           | customer..It is the end customer or data provider who bite
           | the bullet and do the time consuming data cleaning.The
           | customer should know about the errors, duplicates,PII
           | data,inconsitency in data.Customer has to be properly guided
           | to clean the data in best possible manner.
        
           | nerdponx wrote:
           | Except not at all. XML is harder to enter by hand than CSV
           | and not mess it up. CSV optimizes for the easiest cases and
           | performs well on them. XML optimizes for the most complicated
           | cases and therefore performs poorly on the easiest cases.
           | JSON is somewhere in the middle. The main problems with CSV
           | have to do with 1) MS Excel and 2) some kind of delusion
           | among programmers that formatting or parsing arbitrary data
           | is easy and you don't need a library for it, so you get hand-
           | rolled generators and parsers that emit broken files.
           | 
           | Otherwise, the problem with CSV has little to do with the CSV
           | format as such and more to do with the fact that the data is
           | stringly-typed. XML has the same problem. JSON interestingly
           | does not. Everything has tradeoffs.
        
             | aforwardslash wrote:
             | nitpick, I wouldn't place JSON in the middle (the lack of
             | proper integers and precision problems is one of the
             | issues). but other than that, spot on.
        
             | yosai wrote:
             | @nerdponx You are spot on..
        
             | refulgentis wrote:
             | That's really interesting, I wonder if this simplifies down
             | to "you want CSV with column typing and a typesafe CSV
             | editor", as you note, JSONs win is the lack of issues with
             | typing, and CSV really isn't complicated at all except for
             | that property. JSON is just a row with keys that are
             | column.
        
               | nerdponx wrote:
               | I definitely want that! Parquet is great for data
               | interchange, but it's not easily hand-editable. I wonder
               | if there's an open niche in the software world for an
               | Excel-like data entry and manipulation tool, but with
               | stronger/stricter typing of cells and columns, and with
               | direct export to and import from SQLite and Parquet.
        
             | Fnoord wrote:
             | To solve the XML issues you described we got schemas and
             | syntax highlighting.
             | 
             | I hate non-prettified JSON but its easy to prettify in any
             | editor. So its a meh argument against JSON. But to solve
             | the crap with the comma one needs a variant of JSON, and
             | there's various of these...
             | 
             | One other neat feature of CSV is it can be imported in a
             | very popular and powerful IDE, called... Excel.
        
         | refulgentis wrote:
         | yeet is dispose of with haste, not "move, but zoomer" or
         | "sloppily with haste"
        
           | chaps wrote:
           | Wot, no. It's throw hastily we with no care of what it splats
           | into. Pretty sure it came from a video of someone throwing
           | something (food or drink?) in a crowded highschool hallway,
           | from within the hallway. It's chaotic, reckless energy in..
           | mostly.. harmless form.
        
             | wrycoder wrote:
             | You're both saying the same thing from my pov. But thanks
             | for the translations!
        
               | chaps wrote:
               | Now that I'm not a phone, here's the video:
               | https://www.youtube.com/watch?v=2Bjy5YQ5xPc
        
         | recursive wrote:
         | > CSV is ubiquitous but has no singularly well-adopted standard
         | 
         | RFC4180 exists regardless of adoption level. In a way, the
         | simplicity of the spec causes the proliferation of grammars. No
         | one thinks they can just yolo a PDF by hand in a text editor.
         | Ok, maybe PDF is a bad example. But CSV (as specified in
         | RFC4180) is so dead simple that people take shortcuts.
        
           | yosai wrote:
           | Yes you are absolutely right..We need a solution beyond
           | standard as 80% businesses run on CSV..
        
             | ed_elliott_asc wrote:
             | 99.99%?
        
               | groestl wrote:
               | And that's only the portion the businesses know about.
        
               | thedudeabides5 wrote:
               | working on it, this is a hard problem actually
        
         | anothernewdude wrote:
         | You'd have to be insane to trust GPT that much. I wouldn't want
         | anything hallucinated in my data.
        
         | boringg wrote:
         | "half of a data professional's job" ... you mean like 90%
        
         | yosai wrote:
         | @data_ders We realized that more than 70% of business data is
         | shared in CSV and Excel formats, and only a small percentage
         | use API integrations for data exchange..So CSV is here to stay
         | for sure..On the other side,data engine is a sub module inside
         | YoBulk.We are trying to automate complete CSV importing
         | workflow..mostly solving the CSV errors in a collaborative way
         | with the data donor..Yobulk's USP is how we show the errors in
         | human readable way.We have written wrapper on top of some open
         | source data validation engines.Yes..I have used apache
         | Arrow..we not competing with Apache Arrow..We are creating an
         | alternative to flatfile.com
        
           | faebi wrote:
           | I still like the jsonl standard quite a lot. JSON is pretty
           | much universal, yet it's better structured than csv. The
           | ,l'ine in jsonl makes it easier parsable and independent from
           | the remaining structure. Keys are duplicated hard but that's
           | where gzip comes in.
        
           | chaps wrote:
           | Man, been down this path for a long while. It gets tough!
           | Flattening csvs with hierarchical headers (as in, headers
           | that that apply a category to a second row of headers) are
           | tough.
           | 
           | The ways csv can fail is just fucking nuts. Especially when
           | they're half hand written, half automated, or where a failure
           | is 20m rows in. Hard to have speed and strong checks
           | simultaneously.
        
             | yosai wrote:
             | Yes you are right..In YoBulk we flatten the CSV to a JSON
             | schema store it in a document DB and do all the
             | validations.Chunking the CSV and analysing the stream
             | buffers for validation is giving us speed also.
        
         | silent_cal wrote:
         | pd.yeet("/data/data_1.csv")
         | 
         | pd.yeet(lambda x: yeet(x))
         | 
         | pd.yeet_to_csv("/clean_data/cleaned_data_1.csv")
        
       | dstala wrote:
       | > YoBulk harnesses the power of OpenAI to provide advanced column
       | matching
       | 
       | @yosai, can you give an example? just curious
        
         | yosai wrote:
         | @dstala thanks for exploring YoBulk.Under the hood,YoBulk uses
         | Open AI apis which takes the uploaded CSV column name and
         | template column as an input and gives accurate matching.You can
         | try the product and let me know if you have any comment.
        
       | anoonmoose wrote:
       | I've been looking for an AI/GPT/deep learning tool that would
       | help me perform some sanitation and normalization of a large data
       | set that's quite personal to me- my last.fm data, time-stamped
       | logs of (nearly) every song I've listened to for almost twenty
       | years now. The data has all kinds of issues- for example,
       | yesterday I realized that I had two sets of logs for one album.
       | One version of the album used U+2026 (...) and one used three
       | periods (...). There are problems like that, stuff more akin to
       | typos, styling stuff (& vs and), or even garbage-in garbage-out
       | stuff (YouTube Music changing the tags on the same album over
       | time making it look like I actually listened to different albums,
       | or not actually having all of the tags they're supposed to have).
       | 
       | I've got .NET code that hits the last.fm api and dumps the info
       | to a LiteDB database, so I can export to CSV pretty easily if
       | this tool would be useful to me, unless anyone has any better
       | directions to point me in. Appreciate any thoughts you folks
       | have.
        
         | a_subsystem wrote:
         | We're using PowerBI for this kind of thing.
         | 
         | It's certainly not open source, but you put in wonky tables,
         | give it a couple/few examples of how you would like it to be,
         | and it uses AI to spit out clean tables for export.
         | 
         | I'm not a fan of proprietary working files, but if that ever
         | becomes a problem, at least we've still got the data.
        
         | nerdponx wrote:
         | In the case of Unicode at least, the Unicode consortium
         | maintains a database of "confusable" characters and a tool to
         | detect them:
         | https://util.unicode.org/UnicodeJsps/confusables.jsp?a=%E2%8...
         | 
         | You can download the database for use in your own programs, and
         | there is at least one Python package built around it:
         | https://pypi.org/project/confusables/
        
           | carterschonwald wrote:
           | Is there a schema document explaining the format of the
           | dataset?
           | 
           | Edit Found it in the associated doc It's a cute approach !
           | 
           | Data File Format
           | 
           | Each line in the data file has the following format: Field 1
           | is the source, Field 2 is the target, and Field 3 is
           | obsolete, always containing the letters "MA" for backwards
           | compatibility. For example:
           | 
           | 0441 ; 0063 ; MA # ( s - c ) CYRILLIC SMALL LETTER ES - LATIN
           | SMALL LETTER C #
           | 
           | 2CA5 ; 0063 ; MA # (  - c ) COPTIC SMALL LETTER SIMA - LATIN
           | SMALL LETTER C # -c-
           | 
           | Everything after the # is a comment and is purely
           | informative. A asterisk after the comment indicates that the
           | character is not an XID character [UAX31]. The comments
           | provide the character names.
           | 
           | Implementations that use the confusable data do not have to
           | recursively apply the mappings, because the transforms are
           | idempotent. That is,
           | 
           | skeleton(skeleton(X)) = skeleton(X)
        
           | IanCal wrote:
           | If you've got borked encodings around as well, the python
           | package ftfy is wonderful: https://pypi.org/project/ftfy/
           | 
           | Undoes whatever on earth it is excel does, helps clean up
           | bits of html/etc.
        
         | 0live wrote:
         | I would suggest https://github.com/OpenRefine/OpenRefine to
         | clean your data.
        
           | anoonmoose wrote:
           | Love this suggestion, excited to check it out!
        
           | yosai wrote:
           | @0live Cleaning data is only a module.YoBulk helps you to
           | automate complete CSV import workflow.please read our blog
           | https://www.yobulk.dev/blog/Building%20an%20In-
           | house%20CSV%2... to understand the CSV workflow problem.Happy
           | to answer you queries.
        
             | aarondia wrote:
             | From the blog:
             | 
             | > In a typical scenario, the customer success team who is
             | in charge of this activity has to work back and forth with
             | the customer. The customer has to resolve manual
             | (unintended) errors.
             | 
             | + 100. This is even the case when analysts are working with
             | datasets that are created by their colleagues at the same
             | company. Since most companies don't have clear standards
             | for column header labelling, etc. getting a new dataset and
             | incorporating it into an existing workflow requires
             | collaboration with others from inside your company.
        
               | yosai wrote:
               | Thanks for resonating the problem statement.Yes internal
               | teams also face the issue.You Reply on the other team to
               | clean the data.YoBulk is automating this workflow where
               | both data donor and receiver solve the data errors in a
               | much collaborative way.
        
         | yosai wrote:
         | @anoonmoose we have an internal pipeline which streams the
         | monogoDB data to any CSV or any webhook url path.it's a export
         | pipeline which streams the processed data to CSVs.We will
         | expose an API in coming days which will fit your usecase.
        
         | breck wrote:
         | It's still early, but TreeBase might be worth a look.
         | (https://jtree.treenotation.org/treeBase/index.html)
         | 
         | It's the public domain software that powers PLDB.com and
         | CancerDB.com.
         | 
         | You store your data in Tree Notation in plain text files, and
         | use the Grammar Language (a Tree Language) for schemas, which
         | also enforces correctness. You use Git for version control. You
         | then can query the data using TQL (also a Tree Language). You
         | can display your data using Scroll (also a Tree Language)
         | 
         | So your data, your query language, your schemas, your display
         | language are all in the same simple plain-text notation: Tree
         | Notation. Of course, there's also a lot of Javascript glue.
         | 
         | Very little documentation at the moment, and it's brand new,
         | but it simple was not possible before the M1s, which came out
         | in December 2020, and the growth rate is very good.
         | 
         | It's all signal, no noise, so it's a timeless solution, and you
         | won't regret putting your data in there.
        
         | aarondia wrote:
         | Its not an AI based approach, but it is a step up from writing
         | code by hand -- you could try using open source Mito ->
         | https://www.trymito.io -> full disclosure I built it -> to do
         | some of this messy data wrangling. Mito lets you view and
         | manipulate your data in a spreadsheet in Jupyter and it
         | generates the equivalent Python code for each edit. For things
         | like identifying that the data uses '&' and 'and', viewing your
         | data in a spreadsheet is >> just writing code.
         | 
         | Once you generate the code, you could copy it into your
         | pipeline so that you pull the code from the last.fim api,
         | preprocess it with the Python code that Mito generated, and
         | then dump it into the LiteDB.
        
       | [deleted]
        
       | WhiteNoiz3 wrote:
       | When I read the headline I thought this would take a few rows of
       | your CSV file and generate the schema from that using AI. Seems
       | like you still need to manually describe the columns.
        
         | yosai wrote:
         | Yes, we have a workflow for your usecase.YoBulk can create a
         | template or schema by uploading a CSV.We read some lines and
         | create the schema.Right now we have not added AI for that.This
         | flow is very handy for the usecase like when you want to upload
         | a CSV file to hubspot or linkedin and want to do the data
         | cleaning according to hubspot and linkedin defined
         | template.People can use upload linkedin or hubspot template CSV
         | to YoBulk and create a template.Now they can validate their CSV
         | data against the YoBulk template before uploading to hubspot
         | and linkedin portal.
        
         | iLoveOncall wrote:
         | This is a trivial problem you can solve in a hundred lines in
         | any programming language, you don't need AI.
        
           | blowski wrote:
           | It's also something you can do manually with a few admin
           | people, or perhaps using a COBOL script. Innovation means
           | we'll have different ways of doing the same thing.
        
       | jimlongton wrote:
       | Does this send _all_ the data to a third party? What if it
       | contains personal information?
        
         | yosai wrote:
         | No datas are sent to any 3rd party.YoBulk is self hosted.Your
         | personal information is stored on your database.Feel free to
         | ask any data security related question.
        
           | counttheforks wrote:
           | How are you running GPT locally? OpenAI is a third party.
        
             | wstuartcl wrote:
             | it looks like the only thing openai does is generate a
             | schema from a specific hand written prompt from cursory
             | overview.
        
               | yosai wrote:
               | There are multiple use cases where we are using Open AI
               | Example:uploaded CSV's column to template column string
               | matching
        
               | jimlongton wrote:
               | That makes sense. It would be good to add this
               | information to the README.
        
               | yosai wrote:
               | Sure We will add it in README..we captured it in our
               | documentation.Please have a look.
               | https://doc.yobulk.dev/YoBulk%20AI/AI%20usecases
        
             | yosai wrote:
             | We are using OpenAI to only to create schema, column
             | matching and Regex generation.No CSV data are sent to open
             | AI..
        
               | nerdponx wrote:
               | So you ask OpenAI to generate a schema and computer code
               | that will clean input data to conform to that schema, and
               | then run the user's data through that program? Is it
               | possible for users to obtain and audit the generated code
               | for correctness, performance, etc.? How do you prevent
               | things like the AI from generating catastrophically-
               | backtracking regex, O(N!) algorithms, or outright
               | mistakes?
        
               | yosai wrote:
               | @nerdponx Open AI only generate a validation schema.We
               | have our own schema generation engine for any custom
               | validation which can be used where Open AI is not able to
               | understand or generate correct schema.Yes you are
               | absolutely right.Open AIs output is not always right.We
               | have integrated a JSON parser which validate Open AI
               | output and currently developing a regex parser which
               | validates Open AIs output.Hope it answers your
               | query..Happy to understand your pain point more.
        
               | nerdponx wrote:
               | Interesting, thanks for explaining how it works. Would it
               | also be possible to construct one of these JSON schemas
               | by hand, without OpenAI? The core data cleaning system
               | sounds like just as interesting a piece of technology as
               | the AI schema generation.
        
               | yosai wrote:
               | @nerdponx Yes YoBulk provides a way to write JSON schemas
               | by hand..We have added some custom keyword like validate
               | which is not defined in JSON standard..But our validation
               | engine can understand that..You can pass a javascript
               | function through validate keyword..It's a game changer..
               | "first_name":{ "type":"string", "format": "first_name",
               | "validate": "(x) => {\r\n let regex = new
               | RegExp(\"([a-z][a-zA-Z]*)\");\r\n return
               | regex.test(x);\r\n }" },
        
               | nerdponx wrote:
               | This is a very interesting combination of technologies.
               | Thank you again for explaining how they work! I tend to
               | prefer open-source solutions for my own work, but I can
               | see this being highly useful and valuable for many
               | businesses.
        
               | justeleblanc wrote:
               | So what exactly is sent to OpenAI?
        
               | yosai wrote:
               | @justeleblanc we do auto column matching through open
               | AI..Ex:If you have defined a template with column name
               | Date of Joining and the CSV data donor upload a CSV with
               | a field DOJ,then your validation engine will skip the
               | importing as it does not match with expected column
               | names..Here Open AI comes handy..It gets the context and
               | smartly identify that DOJ is same as Date of Joining and
               | does the importing..You can go through
               | https://doc.yobulk.dev/YoBulk%20AI/AI%20usecases to
               | understand more on AI usecases of YoBulk.
        
               | dang wrote:
               | Sorry for the offtopicness but can you please email me at
               | hn@ycombinator.com?
        
               | cdolan wrote:
               | Cheers to OP for whatever magic this interaction has
               | unlocked!
        
               | dang wrote:
               | I resort to comments like that when I don't have another
               | way of contacting the user. If it happens in a Show HN
               | thread, it's probably that I want to ask if they've
               | considered applying to YC with the project. People often
               | underestimate what YC might be interested in. Open-source
               | startups are a big part of what gets funded these days. I
               | particularly love it when startups make it in to YC
               | through HN--it's great for both YC and HN, and if a big
               | success ever came out of it, it would easily fund HN for
               | another century :)
               | 
               | If I post a please-email-me outside of a Show HN thread,
               | it might be anything! but the most common reason is that
               | I want to send them a repost invite for some cool article
               | they posted long ago. Invited reposts make it into the
               | second-chance pool (see https://news.ycombinator.com/pool
               | and https://news.ycombinator.com/item?id=26998308).
        
             | l33t233372 wrote:
             | I understand that this product does use OpenAI's API, but I
             | just want to stress that OpenAI doesn't own GPT, it did not
             | create transformer models, and GPT is not something usable
             | exclusively through OpenAI.
             | 
             | A priori, there's no reason they couldn't include their own
             | GPT model that also lives on your own server.
        
               | olalonde wrote:
               | According to Wikipedia, OpenAI did create GPT[0] and the
               | GPT-3 model[1], which this tool depends on, is
               | exclusively available through the OpenAI API. It's true
               | that some open alternatives to GPT-3 have been popping up
               | but I believe they are still behind in terms of quality.
               | 
               | [0] https://en.wikipedia.org/wiki/Generative_pre-
               | trained_transfo...
               | 
               | [1] https://en.wikipedia.org/wiki/GPT-3
        
               | l33t233372 wrote:
               | Thank you for adding a correction. I shouldn't have been
               | so flippant: OpenAI did not invent transformers, GPT-3 is
               | an iteration of the BERT architecture, which is an
               | encoder, whereas GPT is a decoder model.
               | 
               | Regardless, a GPT model is not a proprietary technology
               | you must get from OpenAI. You're right that this tool
               | uses OpenAI's API, but this isn't necessarily implied by
               | the use of GPT.
        
               | coolspot wrote:
               | > Thank you for adding a correction. I shouldn't have
               | been so flippant: OpenAI did not invent transformers
               | 
               | BTW you answer structure is very similar to ChatGPT's
               | when you point it to a mistake.
        
               | l33t233372 wrote:
               | Well to be fair GPT-3 was trained on internet comments!
        
               | elevenoh wrote:
               | [dead]
        
               | yosai wrote:
               | @l33t233372 you are absolutely spot on.
        
               | [deleted]
        
         | [deleted]
        
       | yarapavan wrote:
       | This looks good and promising! Congrats and best wishes, yosai!!
        
         | yosai wrote:
         | @yarapavan Thanks.Please explore the product!!
        
       | PaulHoule wrote:
       | Around the time that BERT and fasttext were just coming out, I
       | worked at a startup that had built a system that used text CNNs
       | to interpret CSV files, particularly we had models that profiled
       | at the level of individual cells by classifying either the
       | content of the cell plus the content of the cell plus the label
       | of the column.
       | 
       | I was thinking we were going to get bought by one of our big
       | clients like Airbus or a major accounting firm but actually the
       | firm got bought by a major shoe and clothing brand. I still wear
       | swag from that employer to the gym sometimes so I like to think
       | it got transmigrated when the acquisition happened.
        
         | Der_Einzige wrote:
         | My guess is Zelando right? I was always intrigued at the
         | quality of NLP research coming out of them!
        
           | PaulHoule wrote:
           | No, it was this
           | 
           | https://fdra.org/latest-news/nike-acquires-data-
           | integration-...
        
       ___________________________________________________________________
       (page generated 2023-02-21 23:00 UTC)