[HN Gopher] Lessons Learned from Two Years as a Data Scientist
       ___________________________________________________________________
        
       Lessons Learned from Two Years as a Data Scientist
        
       Author : jessedrain
       Score  : 88 points
       Date   : 2021-09-02 04:34 UTC (2 days ago)
        
 (HTM) web link (dawndrain.github.io)
 (TXT) w3m dump (dawndrain.github.io)
        
       | jupin wrote:
       | I love the tip about using the python debugger "pdb". This
       | reminds me of the similar Ansible debug feature (e.g. "debugger:
       | on_failed") which let's you jump into an in-flight playbook.
        
         | Zababa wrote:
         | I remember using OCaml a bit and it has a time-travelling
         | debugger. You launch the program with the debugger, it crashes
         | and then you can inspect the program as you wish. I was really
         | impressed by this.
        
       | throwaway98797 wrote:
       | Good stuff, excpet:
       | 
       | <<I've put a large chunk of my money in leveraged index funds and
       | etfs>>
       | 
       | This is dangerous advice if you go all in on this.
        
         | bt3 wrote:
         | SPXL also has an expense ratio over 1%, which will eat away at
         | earnings unless in the best of bull rushes (now).
        
           | chriak8292 wrote:
           | Leveraged EFTs outperform VTI/VOO (in terms of total return)
           | over 30-40 year investment horizons. Period.
           | 
           | Now, the risk (potential one-year downside) is not for
           | everyone.
        
             | throwaway98797 wrote:
             | A little bit of leverage as others have commented is fine.
             | 
             | The problem happens if there's 51% drop in 2x levered fund.
             | 
             | There's a reason the fund the article's OP is in started in
             | 2008 and not 40 years ago.
        
         | pja wrote:
         | I believe the available evidence suggests that over the long
         | term, a small amount of leverage increases returns. Obviously
         | it increases volatility as well, but if your time horizon is
         | long, you can cope with that.
         | 
         | You do not want to invest in ETFs that are themselves leveraged
         | & rebalance daily however, that's going to eat all your money
         | if you hold them for any length of time - those products are
         | designed to be held for short periods only.
        
       | mrfusion wrote:
       | Can you share a pic of your monitor in use? Sounds like a cool
       | idea.
        
       | joeman1000 wrote:
       | Am I reading this correctly?: you were hired at MS with only
       | cursory python knowledge? I'm very jealous and thinking more and
       | more about going into IT after I leave uni for engineering (not
       | software engineering). I know python well, along with a few other
       | things I've picked up over the years (emacs/elisp, vim/vimscript,
       | LaTeX formatting, JS, Common Lisp, APL, bash scripting,
       | mathematica and matlab etc.). Would this be enough to land a
       | position like yours? I am lacking in the AI area, but I can begin
       | that on my weekends.
        
         | joeman1000 wrote:
         | What's wrong with my comment? I'm just jealous of you IT guys..
        
       | lordgrenville wrote:
       | This feels like someone's private rough notes, but seeing as it's
       | on the front page I have one nit-pick:
       | 
       | > OS packages...which are installed with apt-get (linux) or
       | homebrew (mac)
       | 
       | (Home)brew isn't bundled with Mac, and also works on Linux. And
       | lots of Linux distros don't use apt.
        
       | Zababa wrote:
       | About Java vs Python, modern Java would be:
       | import java.util.ArrayList;         var cars = new
       | ArrayList<String>();
       | 
       | Python would be:                   cars: list[string] = []
       | 
       | The big difference seems to be that ArrayList is not a "default"
       | data structure in Java, but it is in Python. While I like the
       | Python example better, I'm not offended by the modern Java.
        
         | techzerd wrote:
         | The author goes full circle by introducing types in python.
         | Almost every java developer uses an IDE with autocomplete so
         | they would most likely not use a var (where they wouldn't)
         | ..type the interface they want to use, add the assignment
         | operator and then just let the IDE suggest the implementation,
         | add the import and even format the line/file. Such trivialities
         | don't help when attempting to distinguishing the flexibility
         | python bring for arbitrary/quick code
        
         | [deleted]
        
         | lokimedes wrote:
         | That modern Python example scream regression to me. Why not
         | simply cars = []?
         | 
         | This obsession with killing perfectly good languages with
         | strongly typed hints is completely undermining the point.
        
           | tbrownaw wrote:
           | Static type information makes it far easier to avoid a whole
           | class of bugs, which becomes more important as your project
           | gets larger.
        
           | zentropia wrote:
           | I have work with large python project without types. It's a
           | nightmare. Types are extremelly useful.
        
           | Zababa wrote:
           | I think you're in the minority thinking that Python is
           | "killed" by adding static (not strong, Python is already
           | strongly typed) type hints. They are also optional, so you're
           | free to not use them.
           | 
           | To go back to the code example, if you want to express the
           | same thing in Python as in Java, you have to add a type hint
           | to be able to statically check the code. A good thing about
           | Python is that you can choose when and if you want to use a
           | static type hint, while in Java you're forced to use them.
        
         | qsort wrote:
         | Actually, as a Java dev, I quite like modern Java for data
         | work. Streams + static typing make aggregating data a breeze.
         | 
         | I assume for more advanced work like the one OP mentions you'd
         | still want to stick to Python because of the superior
         | ecosystem, but I was pleasantly surprised by Java.
         | 
         | Can't believe I'm writing this, but I actually like modern
         | Java.
        
           | Zababa wrote:
           | I've only followed Java from far away (last time I used it
           | was Java 7 in college) but modern Java seem like a very nice
           | language. There are also Graal, alternative ecosystems to
           | Spring, good things like that.
        
       | kgwgk wrote:
       | Nice pictures. (Didn't read the words... but I liked the
       | pictures.)
        
         | rossdavidh wrote:
         | Good to know I'm not the only one.
        
         | irrational wrote:
         | I'd like a copy of the "Excuses to Miss Meetings" book.
        
         | jsrcout wrote:
         | I just love those fake O'Reilly book covers. Wish I could order
         | the actual books :-)
        
         | 123pie123 wrote:
         | I was distracted (in a good way) by the pictures
         | 
         | but I did read most of it. I thought it was a fairly honest and
         | good blog
         | 
         | I did like the diagram of the different company styles
         | 
         | I'll will read it 100% when I have enough time
        
           | jldugger wrote:
           | > I did like the diagram of the different company styles
           | 
           | That's quite an old internet cartoon.
        
       | [deleted]
        
       | thom wrote:
       | I felt like this article was a bit light on data scientist
       | specific advice, and while I am not one, I do herd them for a
       | living, so thought I'd put some random thoughts together:
       | 
       | 1) Quite often you are not training a machine to be the best at
       | something. You're training a machine to help a human to be the
       | best at something. Be sure to optimise for this when necessary.
       | 
       | 2) Push predictions, don't ask others to pull them. Focus on
       | decoupling your data science team and their customers early on.
       | The worst thing that can happen is duplicating logic, first to
       | craft features during training, and later to submit those
       | features to an API from clients. Even if you hide the feature
       | engineering behind the API, this can either slow down
       | predictions, or still require bulky requests from the client in
       | the case of sequence data. Instead, stream data into your feature
       | store, and stream predictions out onto your event bus. Then your
       | data science team can truly be a black box.
       | 
       | 3) Unit test invariants in your model's outputs. While you can't
       | write tests for exact outputs, you can say "such and such a case
       | should output a higher value than some other case, all things
       | being equal". When your model disagrees, do at least consider
       | that the model may be correct though.
       | 
       | 4) Do ablation tests in reverse, and unit test each addition to
       | your model's architecture to prove it helps.
       | 
       | 5) Often you will train a model on historical data, and content
       | yourself that all future predictions will be outside this
       | training set. However, don't forget that sometimes updates to
       | historical data will trigger a prediction to be recalculated, and
       | this might be overfit. Sometimes you can serve cached results,
       | but small feature changes make this harder.
       | 
       | 6) Your data scientists are probably the people who are most
       | intimate with your data. They will be the first to stumble on
       | bugs and biases, so give them very good channels to report QA
       | issues. If you are a lone data scientist in a larger
       | organisation, seek out and forge these channels early.
       | 
       | 7) Don't treat labelling tools as grubby little hacked together
       | apps. Resource them properly, make sure you watch and listen to
       | the humans building and using them.
       | 
       | 8) Have objective ways of comparing models that are thematically
       | similar but may differ in their exact goal variables. If you
       | can't directly compare log loss or whatever like-for-like, find
       | some more external criteria.
       | 
       | 9) Much of your job is building trust in your models with
       | stakeholders. Don't be afraid to build simple stuff that captures
       | established intuitions before going deep - show people the
       | machine gets the basics first.
       | 
       | 10) If you're struggling to replicate a result from a paper,
       | either with or without the original code, welcome to academia.
       | 
       | Probably not earth shattering stuff, I grant you.
        
       | cobertos wrote:
       | I really liked this post. Tons of small tidbits I feel I'd only
       | get from working in the teams they worked on. Some things I
       | noted:
       | 
       | * Check out `ray` as an alternative to `multiprocessing`
       | 
       | * Check out `tqdm`
       | 
       | * Use `pdb` more
       | 
       | * See if fast.ai or https://jalammar.github.io/illustrated-
       | transformer/ are worthwhile
       | 
       | * Prioritize the papers I read better
       | 
       | * _Leveraged_ index funds?
        
         | chriak8292 wrote:
         | Risk-adjusted return (Sharpe)
         | 
         | - VTI/VOO: 0.6 to 0.8
         | 
         | - Leveraged S&P500 ETFs: 0.5 to 0.9
         | 
         | - Savings account: positive infinity
         | 
         | - Treasury bonds: 1.0 to 1.2
         | 
         | Leveraged index funds give approximately the same risk-adjusted
         | return as passive index funds but with 2-2.5x the absolute
         | return (eg, UPRO).
         | 
         | However, if you have a weak stomach (you don't like seeing your
         | balance drop), then leveraged ETFs are not for you.
        
         | cauthon wrote:
         | Check out ipdb instead of pdb. It's pdb but with the ipython
         | repl instead of python's
        
       | ZephyrBlu wrote:
       | I was hoping for more DS related stuff. It almost sounds like
       | you're learning to be a SWE!
       | 
       | The investing section is curious.
       | 
       | > _On brilliant advice from the man who arguably went from mere
       | millions to decabillions faster than anyone in modern history, I
       | 've put a large chunk of my money in leveraged index funds and
       | etfs_
       | 
       | Who are you referencing here and did you do any DD other than
       | taking his advice?
       | 
       | I'm wondering what the downsides of leveraged index funds and
       | ETFs are since I'm not sure how they work.
        
         | dawndrain wrote:
         | > did you do any DD other than taking his advice?
         | 
         | I did some backtesting simulations that made leveraged
         | investing look pretty awesome. The effective borrow rate for
         | funds like spxl is crazy low, way better than if I were to
         | borrow myself. (Also, fwiw I was pretty conservative and am
         | overall only around 2x-leveraged.)
         | 
         | The internet is very opposed to leveraged investing imo, but I
         | think most of the concerns are pretty dumb. There was this one
         | blog post where this guy ran ten simulations of his own, most
         | of which showed the leveraged portfolio doing comparably to the
         | baseline, but one a couple showed it doing worse and one saw
         | the leveraged portfolio 100x'ing or something... and he
         | concluded that it wasn't worth it??
         | 
         | People will also appeal to volatility drag as a superficially
         | sophisticated knockdown (in short, imagine all four two-step
         | paths in which the market goes up or down by 10% at each step.
         | Then the baseline market averages out to (.81 + .99 + .99 +
         | 1.21)/4 = 1, and a 3x leveraged portfolio averages out to (.49
         | + .91 + .91 + 1.69)/4 = 1. Volatility drag is those two middle
         | worlds where the leveraged portfolio does badly despite the
         | market as a whole basically ending up where it started.
        
           | ZephyrBlu wrote:
           | Thanks for you perspective. I'm considering using leverage so
           | it's interesting to hear from people who are currently using
           | it.
           | 
           | Also, you might find this tweet and paper interesting:
           | 
           | - Tweet:
           | https://twitter.com/patio11/status/1432891941138563077
           | 
           | - Paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=1
           | 0.1.1.89...
        
             | akg_67 wrote:
             | Use keyword "HedgeFundie" (username that first started the
             | discussion on Bogleheads and now most refer this strategy
             | with his name) to search discussions about Leveraged ETFs
             | on Bogleheads forum and Reddit. There are over 300+ pages
             | worth of discussion on this topic only on Bogleheads.
             | 
             | Also checkout information about permanent portfolio put
             | forward by Bridgewater Associates.
        
         | dawndrain wrote:
         | > I was hoping for more DS related stuff. It almost sounds like
         | you're learning to be a SWE!
         | 
         | It kind of felt that way too :). Some more data sciency things
         | were learning how transformers work, hacking pytorch, using
         | visualization tools like tensorboard and wandb, web scraping,
         | better using parallelism, tuning hyperparameters (mostly the
         | learning rate tbh), better fluency with the command line than I
         | assume most swe's need, getting very comfortable inspecting
         | data, making experiments more reproducible, reading lots of
         | papers, writing papers, and trying (somewhat half-heartedly) to
         | get published.
        
         | krrrh wrote:
         | The leverage means they go up faster, but they also go down
         | faster.
         | 
         | In regular investing you feel good when you get a modest
         | return, and bad when you experience a modest loss. With
         | leveraged funds you feel like a genius when the market goes up
         | and like a complete moron when you get wiped out.
        
           | ZephyrBlu wrote:
           | I understand how leverage works, just not in the context of
           | an index fund or ETF.
           | 
           | From your description it sounds like the leverage is baked
           | into it, so it's kind of like "safe" leverage in that you
           | can't lose more than you have.
        
         | kbenson wrote:
         | Almost definitely Warren Buffett, who went so far as to make a
         | very public bet that index funds performed better than hedge
         | funds (and thus most active investing). You can see details on
         | the deal and outcome here
         | https://www.investopedia.com/articles/investing/030916/buffe...
         | 
         | Also, he's got tens of billions.
         | 
         | Edit: Or maybe not on reflection about the "faster than anyone"
         | bit. I dunno.
        
           | comp_throw7 wrote:
           | Sam Bankman-Fried, unless I miss my guess.
        
       | jupin wrote:
       | > Google doesn't allow any production-level projects to be
       | written in python due to safety concerns.
       | 
       | Is this true?
        
       | qsort wrote:
       | > Google doesn't allow any production-level projects to be
       | written in python due to safety concerns
       | 
       | Is this actually true? If it's true that Google doesn't allow
       | Python in production, it seems unlikely that it's due to security
       | concerns.
        
         | dawndrain wrote:
         | I heard this second-hand, not totally sure it's true
        
           | wheelinsupial wrote:
           | > in fact Google doesn't allow any production-level projects
           | to be written in python due to safety concerns.
           | 
           | Then why did you write it as a fact?
        
           | smueller1234 wrote:
           | It's not.
           | 
           | The remote connection to safety (did you mean security?)
           | would be that static source analysis tools don't work as
           | reliably with dynamic languages. That matters at Google. But
           | you don't even have to think as hard about it: Python is
           | simply comparatively slow and inefficient. Google's fleet is
           | large. It pays off to use more efficient languages.
           | 
           | (There's also the whole thing about Python being _largely_
           | single threaded and computers being very wide these days, as
           | well as being a terrible memory hog and memory making up half
           | the cost of servers.)
        
         | MontyCarloHall wrote:
         | Their command line interfaces to Google Cloud Platform (e.g.
         | gcloud/gsutil) is 100% Python. Is that not considered a
         | "production level project"?
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-09-04 23:00 UTC)