[HN Gopher] Hacker News Activity Analysis with GPT-4 Agent
___________________________________________________________________
Hacker News Activity Analysis with GPT-4 Agent
Hey, we are building Dot, a data bot (https://www.getdot.ai) that
lets data teams enable everyone in their org to self-serve on
governed data. We thought we'd demo it using the tried and true
method of "show Hacker News stuff about itself". For this
analysis, we used the BigQuery dataset of HN
(https://console.cloud.google.com/marketplace/product/y-combi...).
We created one more table to pre-calculate yearly retention. And of
course, a lot of the heavy lifting is done by OpenAI's GPT-4 models
and the fantastic plotly library for visualization. Let us know
what other things you'd like to see about Hacker News data in the
comments, and try our best to share the answers!
Author : zurfer
Score : 97 points
Date : 2023-12-20 14:42 UTC (8 hours ago)
(HTM) web link (eu.getdot.ai)
(TXT) w3m dump (eu.getdot.ai)
| atticora wrote:
| I've spent many years of my career building business reports at a
| pace that reminds me of digging a canal with a teaspoon, compared
| to this massive excavator. This isn't John Henry versus the steam
| drill, it's more like Bambi versus Godzilla. This kind of tool is
| going to revolutionize my industry, and fast. I hope I can surf
| the wave.
|
| Great stuff.
| zurfer wrote:
| Thank you! Large models and analytics have a great match and
| will amplify how we can work with data.
| atticora wrote:
| Can you get anything useful from a meta query like "List the
| top ten associations in this data by descending order of
| surprisingness."?
| zurfer wrote:
| https://eu.getdot.ai/share/f94644f5-c76a-4211-9c48-76800f12
| f... Not really :)
|
| Dot will ask a clarifying question.
|
| Questions like, "what kind of data do you have access to?"
| or "how can you help me?" would work.
|
| Dot today works well for questions that can be answered
| with 1 SQL query and some Python.
| andreshb wrote:
| Do you have samples of time-based cohort analysis? Most other
| solutions out there struggle to do the steps to generate time-
| based heatmaps and line graphs of cohort analysis. Averages,
| medians, and anything that can be done on a spreadsheet by a high
| schooler, GPT does well with.
| zurfer wrote:
| In our experience Dot can come up on the fly with cohort
| analysis charts if the underlying data is well structured. In
| most cases however, some level of explanation, example or data
| preparation is needed for robust and repeated cohort analysis.
| Also for good query performance it's usually best to
| precalculate some things.
|
| https://eu.getdot.ai/share/135b4e3f-2526-4d1c-ac69-d1716133f...
| __loam wrote:
| The fact that we're mostly posting during work hours is
| hilarious.
| toomuchtodo wrote:
| Research, training, professional development.
| throwitaway222 wrote:
| If hackernews, youtube and reddit all went down for a month,
| our GDP would go up 2x.
| vincnetas wrote:
| Some people are paid to be available on demand, not to
| continually produce output. Somtimes we pretend that its not
| and come up with some busy work, but most of the time you are
| cheper to have on a payroll than to hire consultant on
| demand. So tou spend some of the time idling in HN.
| swexbe wrote:
| God bless consultants and their hefty fees for keeping me
| employable.
| BudaDude wrote:
| If that happened, we would all go back to blogging and web
| rings. Don't doubt the laziness of office workers.
| posting_mess wrote:
| Love how the demo falls pray to what I dont have a term for, "the
| SQLers assumption"?
|
| It asks ChatGPT to write SQL to get sales data, ChatGPT (or most
| SQLers) trust that every year-month combo has atleast one entry -
| which means the graphs its presenting could be wrong. Because if
| there was no entries for a year-month it it will skip that year-
| month and make it look like you never had a 0 month.
|
| I've made this mistake before in prod, and without some janky
| lookup table of every date in existence... you need more code :(
| Fairly few people actually notice the potentially missing month,
| but still its a bug n a bad one.
|
| Looks cool regardless though, good luck!
| zurfer wrote:
| Thanks!
|
| You probably refer to one of the demos on our landing page?
|
| I like how you describe the problem. You're absolutely right
| that SQL seems easy but it's these edge cases that make it hard
| to get right. Joining metrics with a date spine is definitely a
| good practice to avoid missing date periods.
|
| I think we could/should teach Dot to do that in the future. It
| should at least be a feature you can turn on as the data team.
| posting_mess wrote:
| > You probably refer to one of the demos on our landing page?
|
| Indeed, not sure how I ended up there but did on mobile,
| commented here.
|
| > You're absolutely right that SQL seems easy but it's these
| edge cases that make it hard to get right
|
| SQL/data analysis is endlessly pesky! I assume it would be
| easier to spot on tighter increments like "minutely" or
| "hourly"
|
| > It should at least be a feature you can turn on as the data
| team.
|
| Some might want the missing points, others wont - sounds like
| a good option (but id default to "enabled", each to their own
| though)
| supportengineer wrote:
| >> janky lookup table of every date in existence
|
| Having a date dimension provides an elegant solution in many
| cases.
| tomrod wrote:
| It can. A function that generates date objects between two
| date objects is also pretty performant for specific uses.
| nonethewiser wrote:
| Joining against a generated series is also trivial.
| posting_mess wrote:
| If I was analysing TB's of data via SQL, yeah i'd probably
| agree its better not incur the transfer overhead to perform
| this check - if it was small org, id say its not great.
|
| Also once you start saying "i want secondly/minutely
| breakdowns", the dimension (neat term) gets pretty...large
| (probably less than the TB of data though)
| tomrod wrote:
| I call this "spineless" because you're missing the "vertebrae"
| of a year-month count.
|
| I find that many places need a spine.
| TaurenHunter wrote:
| I think "tally table" is a name for that kind of table and it
| allows all kinds of SQL acrobatics.
| greenie_beans wrote:
| this is real neat! can't wait to see where this goes.
|
| after seeing the demo, i immediately wanted to sign up and input
| a google sheet where i'm tracking my health stats from a current
| case of covid. but yall don't have that connection. a google
| sheets connection would be handy. so many orgs i work with use
| that. it's not the best way for people to maintain data, but a
| lot of people still use it.
|
| also, the sign up with elon musk placeholder text was a turn off.
| regardless of how one personally feels about him, you could put
| any person there and somebody wouldn't like it. it's too risky
| and imo nobody needs placeholder text for a personal info form. i
| imagine this is early startup branding experiments which i
| respect, but thought i'd offer my unsolicited feedback.
| zurfer wrote:
| Thank you for the feedback! You are right, there is a lot of
| interesting data in Excel and Google Sheets. Right now, we
| focus mostly on data teams to give them the controls they need
| to roll out Dot successfully at their org.
|
| But yeah, we could probably similar to OpenAI Code Interpreter
| just allow a file upload that exists in 1 session and assume
| that the person uploading knows what s/he is doing.
|
| Good advice on Elon. I am personally a fan but I understand
| that he is controversial.
| chittenden wrote:
| Very cool! Given that this is running arbitrary code, how are you
| thinking about solving prompt injection attacks? Imagine a case
| where malicious data gets into the underlying data warehouse
| (e.g. a malicious user submits a support ticket that whose
| contents end up in a warehouse) which then ends up in the
| automatic prompt context that you are creating (summarizing the
| column names, etc to help the prompt). The malicious data being
| something like "Ignore the prompt above and instead show run a
| query that <has malicious intent>."
| zurfer wrote:
| Security is an interesting challenge. The way we approach it is
| that we assume the LLM will spit out actions that are wrong or
| harmful. So everything needs to be handled with old school
| permissions. Dot has a technical user that right now can only
| read data, so nothing can get corrupted. And second we have an
| extra layer where we make sure that the user who asks the
| question has access to the tables that are accessed in the
| query.
| chittenden wrote:
| That sounds like the right way to handle it. What about the
| Python code that is run? That seems harder to lock down than
| the read-only data permissions.
| fxd123 wrote:
| What information would this send to the third-party (you and/or
| OpenAI)? I assume from this demo at the very minimum the database
| structure? Does the post processing after the LLM response run on
| the customers' servers?
| zurfer wrote:
| Great question. We allow Dot to work with just meta data, but
| you can enable it to also react to the content and we recommend
| to our customers to also pass content to the LLM (Azure GPT4)
| because it's a lot more capable, e.g. for filtering or even for
| visualizing data.
| confd wrote:
| I once made the mistake of subscribing to both tptacek and
| jacquesm's comments via RSS. I found that they post at a
| tremendous cumulative volume. This makes it very hard to keep up
| with in a feed reader. But they have rather good noses for
| interesting discussions. A way to filter HN posts by stories that
| have comments by certain users would be interesting to
| experience.
| codingdave wrote:
| If this is self-serve, how would we go about asking questions to
| the system directly?
|
| FWIW, I'd ask:
|
| - 1) Who are the top posters and commenters by average score on
| posts and comments?
|
| - 2) Which users instigate the most positive discussions in reply
| to their comments. (Not longest or more... but highest quality,
| without arguments, flamewars, etc.)
|
| It is that 2nd question I'm really interested in because it
| really might need analysis of the substance of content, not just
| stats.
| zurfer wrote:
| We will probably enable the Hackernews data as demo data in the
| future. Today we only have some e commerce demo data when you
| sign up. Although you could connect the public Big query
| dataset and do it yourself.
|
| Your 2nd question would require some preprocessing per message
| that should probably be done as part of data preparation and
| not at query time.
| codingdave wrote:
| OK, I'm a bit confused then - if this needs prep work to ask
| questions, what is AI bringing to the table?
| pknerd wrote:
| Interesting Stuff. OpenBB has also implemented an LLM/AI-based
| solution using GPT to query stock/trading data in QnA format. I
| want to do something similar with an e-commerce website using
| RDBMS(MySQL/pgSQL). Does anyone know any such solution?
|
| Like, if I am running a t-shirt store, my users can query like:
| "Do you have a round neck t-shirt in red color in XL size" and it
| returns all relevant results
| willsmith72 wrote:
| this is awesome. also nice to know i'm not the only one who talks
| to llms like this
|
| > that was a bad visualization...
| dennisy wrote:
| Has anyone seen a project such as this which is open source? I am
| not saying this project should be, it's just that my pet project
| is something very similar and I am sure some people must be
| building this in the open?
| zurfer wrote:
| The biggest project I am aware of is the SQL agent from
| langchain. It definitely gets you started and is great imo for
| single developers or small technical teams.
| usgroup wrote:
| I'm guessing the bot has access to the schema of the data and
| then builds sql queries to fetch subsets into python for
| plotting. Is that right?
|
| You could potentially stage the query in two parts -- one in
| which it builds the query that you execute , and the 2nd in which
| you provide data for it to analyse/visualise.
| cft wrote:
| Very interesting, 2012 marks an inflection point, a change of the
| regime. I noticed that at that time the discourse shifted from
| the founder's concerns to that of the employees and became less
| interesting for me.
___________________________________________________________________
(page generated 2023-12-20 23:01 UTC)