[HN Gopher] Building a personal data warehouse in Snowflake for ...
       ___________________________________________________________________
        
       Building a personal data warehouse in Snowflake for fun and no
       profit
        
       Author : thomasdziedzic
       Score  : 141 points
       Date   : 2021-01-31 11:24 UTC (11 hours ago)
        
 (HTM) web link (www.thomasdziedzic0.com)
 (TXT) w3m dump (www.thomasdziedzic0.com)
        
       | alex_young wrote:
       | "3,850 requests per second" to hacker news? Sounds like a denial
       | of service attack ;)
        
         | eat_veggies wrote:
         | The hacker news API is served by firebase -- I'm sure google's
         | infrastructure can handle that traffic
        
         | MoOmer wrote:
         | Agreed, it would have been nice to see some sort of delay or
         | rate limit.
        
           | thomasdziedzic wrote:
           | Thanks for the suggestion. Added a rate_limit parameter to my
           | script to play nicely with the api :)
           | https://github.com/thomasdziedzic/hackernews-
           | etl/commit/6dff...
        
         | thomasdziedzic wrote:
         | Ah thanks for mentioning this. I looked at the blog post and I
         | seem to have prefixed an extra 1 to the number of requests done
         | by my python script. I updated the post to include the actual
         | calculation of 402 requests per second which is a lot less
         | impressive but still much faster than my original approach on
         | half the hardware.
        
       | wallawaz wrote:
       | Thanks for sharing! I've recently starting using Snowflake - it
       | definitely has some nice features (cloning, storage integrations,
       | etc). I'm interested what tooling you used to generate your
       | visualizations. Were the charts generated directly from query
       | output or did you need to load your query results into something
       | like seaborn?
        
         | thomasdziedzic wrote:
         | Snowflake's preview app called Snowsight provides a way to both
         | query and visualize the data.
         | https://docs.snowflake.com/en/user-guide/ui-snowsight-visual...
        
           | wallawaz wrote:
           | Thanks!
        
       | mpeteuil wrote:
       | Nice post! For your "How popular is investing in the community?"
       | section, I would try to adjust that so it's not an absolute
       | number. Right now you don't know whether "investing" seems more
       | popular just because there are more users or stories overall or
       | because a higher proportion of the items submitted are about
       | investing. For example, instead of the raw number of stories that
       | month with "investing" in them, look at something like that
       | number divided by the total number of stories for that month,
       | which would give you the fraction of stories that month that
       | contained the term "investing".
        
       | asah wrote:
       | nice.
       | 
       | FYI that stock regexp is buggy and e.g. will match $42.36 which
       | obviously isn't a symbol.
       | 
       | Indeed, symbols are tough enough to nail correctly e.g. people
       | not including the $, symbols with periods, etc.
       | 
       | Offhand, I'd build a little neural net classifier (e.g.
       | https://fasttext.cc/ ) and train this on a slew of example-posts
       | that are/aren't about stonks. To get training data, use regexps
       | and then run through them by hand (20+ per minute per hour =
       | 1200/hour, or outsource to amazon mturk $0.25 per 10, incl
       | verification = $30/1200). Also, there's probably easy ones you
       | can classify 100% correctly with regexps, to increase the
       | training set size.
       | 
       | I'm happy to help if you like.
        
       | osintdude wrote:
       | nice
        
       | dvfjsdhgfv wrote:
       | > you will need to add the following line to limits.conf and then
       | reboot the machine
       | 
       | Is this normal in AWS? On most systems I worked with it's not
       | true, you just need to start a new session - a reboot is not
       | necessary.
        
       | bthomas wrote:
       | Off topic: what do you use for CMS? I like how the code blocks
       | and graphs are integrated. Great post!
        
         | thomasdziedzic wrote:
         | I use Squarespace to host my site.
         | 
         | Unfortunately it doesn't have native support for code
         | highlighting besides html/css/js. I ended up using prism.js
         | https://prismjs.com/ to highlight my code. The graphs are just
         | screenshots that I've uploaded as pictures.
        
       | indrayam wrote:
       | As someone who never thought of using Snowflake for personal use,
       | what was the cost of setting up and using Snowflake? Did you use
       | Snowflake in AWS? FWIW: I am super new to the Snowflake ecosystem
        
         | thomasdziedzic wrote:
         | This is a great question and something that I probably should
         | have at least mentioned in the blog post. I'm currently using
         | the free trial since I signed up for it around Jan 15th. I am
         | using the standard plan on AWS (US East Ohio).
         | 
         | Snowflake charges $2 per credit in compute costs on their
         | standard plan and $40/TB/month in storage. See
         | https://www.snowflake.com/pricing/ for more details. Since the
         | beginning of my trial, I have used 9 credits and 40GB of
         | storage. So I have used about (9 * 2) + ((40 / 1000) * 40) =
         | $19.60 worth of resources so far. This isn't terrible but I
         | must caution you that Snowflake costs could balloon pretty
         | fast. That's why I also set up a resource monitor on my account
         | to stop all warehouses when the number of credits reaches 20.
         | https://docs.snowflake.com/en/user-guide/resource-monitors.h...
         | 
         | I would recommend if you were going to go my route, to setup
         | resource monitors, and only use XSMALL warehouses which consume
         | 1 credit / hour. Snowflake sizes warehouses in t-shirt sizes,
         | and the credits consumed doubles each size up. So XSMALL = 1
         | credit per hour, SMALL = 2, MEDIUM = 4, LARGE = 8, XLARGE = 16,
         | 2XLARGE = 32, 3XLARGE = 64 and 4XLARGE = 128.
        
       | rsync wrote:
       | Question for mods/dang/etc.: How interesting was it that a third
       | party hit HN @ 300 requests per second for a total of 25M hits ?
        
         | konha wrote:
         | The requests went to hacker-news.firebaseio.com which I believe
         | is hosted by GCP, separate from HN's infrastructure.
        
       | enz wrote:
       | Great use-cases for SQL window functions!
        
       | dm13450 wrote:
       | Minor point, but when calculating the avg_diff_price for $GME you
       | should be calculating a return (close-open)/open otherwise days
       | where the stock went from 100 to 105 (5% increase) look the same
       | as days when it went from 5 to 10 (100% increase).
       | 
       | Likewise, when calculating the correlations, that should be done
       | on returns and not prices.
        
         | qeternity wrote:
         | I'm not sure that's a given. They are different measures, but
         | both completely valid. A return might be more meaningful to a
         | long term investor, but ultimately every other metric and
         | participant is interested in the absolute change: traders to
         | calculate pnl/greeks/etc, exchanges to match orders, clearers
         | to calculate margin.
         | 
         | If I am wearing 100 shares, whether it went from $5 to $10 or
         | $100 to $105, it's the same pnl.
        
       | xhedley wrote:
       | First question posed to data set: when is the best time to post
       | to Hacker News to maximise expected page rank?
       | 
       | Given you posted at your calculated time of 11:00 UTC on a
       | Sunday, will be interesting to see how it does. It's at 13 right
       | now.
        
         | vxNsr wrote:
         | I think the content and title also matter. I didn't know what
         | snowflake was until I read the blog, but the title's play on
         | "... for fun and profit" got me to at least check it out.
        
       | leetrout wrote:
       | Question to the author and anyone else here:
       | 
       | Do you find the advice to reboot the VM after making a security
       | change useful?
       | 
       | It will certainly work but from my understanding it should be
       | able to change that setting and alter the users login such that
       | you only need to logout and back in. Maybe I am mistaken?
       | 
       | I try very hard to avoid reboots - it may be distracting for a
       | focused article like this - so curious what others think.
        
         | daniellarusso wrote:
         | Are you asking about the max number of open files allowed?
         | 
         | That was why the reboot was needed.
        
           | leetrout wrote:
           | You shouldn't need to reboot the system for settings to take
           | effect - pretty much everything except some kernel upgrades
           | can be done without a reboot.
        
         | simonw wrote:
         | My rule of thumb is to reboot instances after making changes
         | mainly so you can be confident that the machine will continue
         | to work correctly after the reboot, and you'll still be able to
         | login.
         | 
         | This is based on past experiences where I've configured a VM,
         | then a year later needed to make changes but been terrified of
         | a reboot because I can't remember if it's likely to keep
         | working or not!
        
           | lostapathy wrote:
           | Been there done that! If you do end up with a machine that's
           | non-bootable or seriously impaired, it's a lot easier to
           | figure out when it's "just" today's update in question and
           | not a year's worth.
        
         | Damogran6 wrote:
         | I think it's a holdover from older times...and underscores a
         | resiliency thing. Rebooting forces garbage collection, and
         | being able to withstand a reboot with no loss of service builds
         | geek cred. :D
        
         | thomasdziedzic wrote:
         | According to https://unix.stackexchange.com/a/108605 it looks
         | like only a logout and login was required for the changes to
         | take effect. I didn't know this at the time which is why I
         | initiated a reboot.
        
       | fuy wrote:
       | Interesting read, thanks! Regarding generating missing ids (quote
       | from the blogpost: "If you know how one would generate the
       | missing ids between the gaps (which could be of variable size)":
       | I don't have access to snowflake instance, but the following
       | works in Postgres (if I understood the problem correctly):
       | with lead as (           select id, lead(id, 1) over (order by
       | id) as lead_by_one            from gaps_table),        gaps as (
       | select id + 1 as start_gap,            lead_by_one - 1 as end_gap
       | from lead where lead_by_one - id > 1)        select
       | generate_series(start_gap, end_gap) as missing_ids from gaps
       | 
       | So if Snowflake has a generating function similar to
       | generate_series, it should do the trick.
        
       ___________________________________________________________________
       (page generated 2021-01-31 23:01 UTC)