[HN Gopher] Building a personal data warehouse in Snowflake for ...
___________________________________________________________________
Building a personal data warehouse in Snowflake for fun and no
profit
Author : thomasdziedzic
Score : 141 points
Date : 2021-01-31 11:24 UTC (11 hours ago)
(HTM) web link (www.thomasdziedzic0.com)
(TXT) w3m dump (www.thomasdziedzic0.com)
| alex_young wrote:
| "3,850 requests per second" to hacker news? Sounds like a denial
| of service attack ;)
| eat_veggies wrote:
| The hacker news API is served by firebase -- I'm sure google's
| infrastructure can handle that traffic
| MoOmer wrote:
| Agreed, it would have been nice to see some sort of delay or
| rate limit.
| thomasdziedzic wrote:
| Thanks for the suggestion. Added a rate_limit parameter to my
| script to play nicely with the api :)
| https://github.com/thomasdziedzic/hackernews-
| etl/commit/6dff...
| thomasdziedzic wrote:
| Ah thanks for mentioning this. I looked at the blog post and I
| seem to have prefixed an extra 1 to the number of requests done
| by my python script. I updated the post to include the actual
| calculation of 402 requests per second which is a lot less
| impressive but still much faster than my original approach on
| half the hardware.
| wallawaz wrote:
| Thanks for sharing! I've recently starting using Snowflake - it
| definitely has some nice features (cloning, storage integrations,
| etc). I'm interested what tooling you used to generate your
| visualizations. Were the charts generated directly from query
| output or did you need to load your query results into something
| like seaborn?
| thomasdziedzic wrote:
| Snowflake's preview app called Snowsight provides a way to both
| query and visualize the data.
| https://docs.snowflake.com/en/user-guide/ui-snowsight-visual...
| wallawaz wrote:
| Thanks!
| mpeteuil wrote:
| Nice post! For your "How popular is investing in the community?"
| section, I would try to adjust that so it's not an absolute
| number. Right now you don't know whether "investing" seems more
| popular just because there are more users or stories overall or
| because a higher proportion of the items submitted are about
| investing. For example, instead of the raw number of stories that
| month with "investing" in them, look at something like that
| number divided by the total number of stories for that month,
| which would give you the fraction of stories that month that
| contained the term "investing".
| asah wrote:
| nice.
|
| FYI that stock regexp is buggy and e.g. will match $42.36 which
| obviously isn't a symbol.
|
| Indeed, symbols are tough enough to nail correctly e.g. people
| not including the $, symbols with periods, etc.
|
| Offhand, I'd build a little neural net classifier (e.g.
| https://fasttext.cc/ ) and train this on a slew of example-posts
| that are/aren't about stonks. To get training data, use regexps
| and then run through them by hand (20+ per minute per hour =
| 1200/hour, or outsource to amazon mturk $0.25 per 10, incl
| verification = $30/1200). Also, there's probably easy ones you
| can classify 100% correctly with regexps, to increase the
| training set size.
|
| I'm happy to help if you like.
| osintdude wrote:
| nice
| dvfjsdhgfv wrote:
| > you will need to add the following line to limits.conf and then
| reboot the machine
|
| Is this normal in AWS? On most systems I worked with it's not
| true, you just need to start a new session - a reboot is not
| necessary.
| bthomas wrote:
| Off topic: what do you use for CMS? I like how the code blocks
| and graphs are integrated. Great post!
| thomasdziedzic wrote:
| I use Squarespace to host my site.
|
| Unfortunately it doesn't have native support for code
| highlighting besides html/css/js. I ended up using prism.js
| https://prismjs.com/ to highlight my code. The graphs are just
| screenshots that I've uploaded as pictures.
| indrayam wrote:
| As someone who never thought of using Snowflake for personal use,
| what was the cost of setting up and using Snowflake? Did you use
| Snowflake in AWS? FWIW: I am super new to the Snowflake ecosystem
| thomasdziedzic wrote:
| This is a great question and something that I probably should
| have at least mentioned in the blog post. I'm currently using
| the free trial since I signed up for it around Jan 15th. I am
| using the standard plan on AWS (US East Ohio).
|
| Snowflake charges $2 per credit in compute costs on their
| standard plan and $40/TB/month in storage. See
| https://www.snowflake.com/pricing/ for more details. Since the
| beginning of my trial, I have used 9 credits and 40GB of
| storage. So I have used about (9 * 2) + ((40 / 1000) * 40) =
| $19.60 worth of resources so far. This isn't terrible but I
| must caution you that Snowflake costs could balloon pretty
| fast. That's why I also set up a resource monitor on my account
| to stop all warehouses when the number of credits reaches 20.
| https://docs.snowflake.com/en/user-guide/resource-monitors.h...
|
| I would recommend if you were going to go my route, to setup
| resource monitors, and only use XSMALL warehouses which consume
| 1 credit / hour. Snowflake sizes warehouses in t-shirt sizes,
| and the credits consumed doubles each size up. So XSMALL = 1
| credit per hour, SMALL = 2, MEDIUM = 4, LARGE = 8, XLARGE = 16,
| 2XLARGE = 32, 3XLARGE = 64 and 4XLARGE = 128.
| rsync wrote:
| Question for mods/dang/etc.: How interesting was it that a third
| party hit HN @ 300 requests per second for a total of 25M hits ?
| konha wrote:
| The requests went to hacker-news.firebaseio.com which I believe
| is hosted by GCP, separate from HN's infrastructure.
| enz wrote:
| Great use-cases for SQL window functions!
| dm13450 wrote:
| Minor point, but when calculating the avg_diff_price for $GME you
| should be calculating a return (close-open)/open otherwise days
| where the stock went from 100 to 105 (5% increase) look the same
| as days when it went from 5 to 10 (100% increase).
|
| Likewise, when calculating the correlations, that should be done
| on returns and not prices.
| qeternity wrote:
| I'm not sure that's a given. They are different measures, but
| both completely valid. A return might be more meaningful to a
| long term investor, but ultimately every other metric and
| participant is interested in the absolute change: traders to
| calculate pnl/greeks/etc, exchanges to match orders, clearers
| to calculate margin.
|
| If I am wearing 100 shares, whether it went from $5 to $10 or
| $100 to $105, it's the same pnl.
| xhedley wrote:
| First question posed to data set: when is the best time to post
| to Hacker News to maximise expected page rank?
|
| Given you posted at your calculated time of 11:00 UTC on a
| Sunday, will be interesting to see how it does. It's at 13 right
| now.
| vxNsr wrote:
| I think the content and title also matter. I didn't know what
| snowflake was until I read the blog, but the title's play on
| "... for fun and profit" got me to at least check it out.
| leetrout wrote:
| Question to the author and anyone else here:
|
| Do you find the advice to reboot the VM after making a security
| change useful?
|
| It will certainly work but from my understanding it should be
| able to change that setting and alter the users login such that
| you only need to logout and back in. Maybe I am mistaken?
|
| I try very hard to avoid reboots - it may be distracting for a
| focused article like this - so curious what others think.
| daniellarusso wrote:
| Are you asking about the max number of open files allowed?
|
| That was why the reboot was needed.
| leetrout wrote:
| You shouldn't need to reboot the system for settings to take
| effect - pretty much everything except some kernel upgrades
| can be done without a reboot.
| simonw wrote:
| My rule of thumb is to reboot instances after making changes
| mainly so you can be confident that the machine will continue
| to work correctly after the reboot, and you'll still be able to
| login.
|
| This is based on past experiences where I've configured a VM,
| then a year later needed to make changes but been terrified of
| a reboot because I can't remember if it's likely to keep
| working or not!
| lostapathy wrote:
| Been there done that! If you do end up with a machine that's
| non-bootable or seriously impaired, it's a lot easier to
| figure out when it's "just" today's update in question and
| not a year's worth.
| Damogran6 wrote:
| I think it's a holdover from older times...and underscores a
| resiliency thing. Rebooting forces garbage collection, and
| being able to withstand a reboot with no loss of service builds
| geek cred. :D
| thomasdziedzic wrote:
| According to https://unix.stackexchange.com/a/108605 it looks
| like only a logout and login was required for the changes to
| take effect. I didn't know this at the time which is why I
| initiated a reboot.
| fuy wrote:
| Interesting read, thanks! Regarding generating missing ids (quote
| from the blogpost: "If you know how one would generate the
| missing ids between the gaps (which could be of variable size)":
| I don't have access to snowflake instance, but the following
| works in Postgres (if I understood the problem correctly):
| with lead as ( select id, lead(id, 1) over (order by
| id) as lead_by_one from gaps_table), gaps as (
| select id + 1 as start_gap, lead_by_one - 1 as end_gap
| from lead where lead_by_one - id > 1) select
| generate_series(start_gap, end_gap) as missing_ids from gaps
|
| So if Snowflake has a generating function similar to
| generate_series, it should do the trick.
___________________________________________________________________
(page generated 2021-01-31 23:01 UTC)