[HN Gopher] Lessons Learned from Two Years as a Data Scientist
___________________________________________________________________
Lessons Learned from Two Years as a Data Scientist
Author : jessedrain
Score : 88 points
Date : 2021-09-02 04:34 UTC (2 days ago)
(HTM) web link (dawndrain.github.io)
(TXT) w3m dump (dawndrain.github.io)
| jupin wrote:
| I love the tip about using the python debugger "pdb". This
| reminds me of the similar Ansible debug feature (e.g. "debugger:
| on_failed") which let's you jump into an in-flight playbook.
| Zababa wrote:
| I remember using OCaml a bit and it has a time-travelling
| debugger. You launch the program with the debugger, it crashes
| and then you can inspect the program as you wish. I was really
| impressed by this.
| throwaway98797 wrote:
| Good stuff, excpet:
|
| <<I've put a large chunk of my money in leveraged index funds and
| etfs>>
|
| This is dangerous advice if you go all in on this.
| bt3 wrote:
| SPXL also has an expense ratio over 1%, which will eat away at
| earnings unless in the best of bull rushes (now).
| chriak8292 wrote:
| Leveraged EFTs outperform VTI/VOO (in terms of total return)
| over 30-40 year investment horizons. Period.
|
| Now, the risk (potential one-year downside) is not for
| everyone.
| throwaway98797 wrote:
| A little bit of leverage as others have commented is fine.
|
| The problem happens if there's 51% drop in 2x levered fund.
|
| There's a reason the fund the article's OP is in started in
| 2008 and not 40 years ago.
| pja wrote:
| I believe the available evidence suggests that over the long
| term, a small amount of leverage increases returns. Obviously
| it increases volatility as well, but if your time horizon is
| long, you can cope with that.
|
| You do not want to invest in ETFs that are themselves leveraged
| & rebalance daily however, that's going to eat all your money
| if you hold them for any length of time - those products are
| designed to be held for short periods only.
| mrfusion wrote:
| Can you share a pic of your monitor in use? Sounds like a cool
| idea.
| joeman1000 wrote:
| Am I reading this correctly?: you were hired at MS with only
| cursory python knowledge? I'm very jealous and thinking more and
| more about going into IT after I leave uni for engineering (not
| software engineering). I know python well, along with a few other
| things I've picked up over the years (emacs/elisp, vim/vimscript,
| LaTeX formatting, JS, Common Lisp, APL, bash scripting,
| mathematica and matlab etc.). Would this be enough to land a
| position like yours? I am lacking in the AI area, but I can begin
| that on my weekends.
| joeman1000 wrote:
| What's wrong with my comment? I'm just jealous of you IT guys..
| lordgrenville wrote:
| This feels like someone's private rough notes, but seeing as it's
| on the front page I have one nit-pick:
|
| > OS packages...which are installed with apt-get (linux) or
| homebrew (mac)
|
| (Home)brew isn't bundled with Mac, and also works on Linux. And
| lots of Linux distros don't use apt.
| Zababa wrote:
| About Java vs Python, modern Java would be:
| import java.util.ArrayList; var cars = new
| ArrayList<String>();
|
| Python would be: cars: list[string] = []
|
| The big difference seems to be that ArrayList is not a "default"
| data structure in Java, but it is in Python. While I like the
| Python example better, I'm not offended by the modern Java.
| techzerd wrote:
| The author goes full circle by introducing types in python.
| Almost every java developer uses an IDE with autocomplete so
| they would most likely not use a var (where they wouldn't)
| ..type the interface they want to use, add the assignment
| operator and then just let the IDE suggest the implementation,
| add the import and even format the line/file. Such trivialities
| don't help when attempting to distinguishing the flexibility
| python bring for arbitrary/quick code
| [deleted]
| lokimedes wrote:
| That modern Python example scream regression to me. Why not
| simply cars = []?
|
| This obsession with killing perfectly good languages with
| strongly typed hints is completely undermining the point.
| tbrownaw wrote:
| Static type information makes it far easier to avoid a whole
| class of bugs, which becomes more important as your project
| gets larger.
| zentropia wrote:
| I have work with large python project without types. It's a
| nightmare. Types are extremelly useful.
| Zababa wrote:
| I think you're in the minority thinking that Python is
| "killed" by adding static (not strong, Python is already
| strongly typed) type hints. They are also optional, so you're
| free to not use them.
|
| To go back to the code example, if you want to express the
| same thing in Python as in Java, you have to add a type hint
| to be able to statically check the code. A good thing about
| Python is that you can choose when and if you want to use a
| static type hint, while in Java you're forced to use them.
| qsort wrote:
| Actually, as a Java dev, I quite like modern Java for data
| work. Streams + static typing make aggregating data a breeze.
|
| I assume for more advanced work like the one OP mentions you'd
| still want to stick to Python because of the superior
| ecosystem, but I was pleasantly surprised by Java.
|
| Can't believe I'm writing this, but I actually like modern
| Java.
| Zababa wrote:
| I've only followed Java from far away (last time I used it
| was Java 7 in college) but modern Java seem like a very nice
| language. There are also Graal, alternative ecosystems to
| Spring, good things like that.
| kgwgk wrote:
| Nice pictures. (Didn't read the words... but I liked the
| pictures.)
| rossdavidh wrote:
| Good to know I'm not the only one.
| irrational wrote:
| I'd like a copy of the "Excuses to Miss Meetings" book.
| jsrcout wrote:
| I just love those fake O'Reilly book covers. Wish I could order
| the actual books :-)
| 123pie123 wrote:
| I was distracted (in a good way) by the pictures
|
| but I did read most of it. I thought it was a fairly honest and
| good blog
|
| I did like the diagram of the different company styles
|
| I'll will read it 100% when I have enough time
| jldugger wrote:
| > I did like the diagram of the different company styles
|
| That's quite an old internet cartoon.
| [deleted]
| thom wrote:
| I felt like this article was a bit light on data scientist
| specific advice, and while I am not one, I do herd them for a
| living, so thought I'd put some random thoughts together:
|
| 1) Quite often you are not training a machine to be the best at
| something. You're training a machine to help a human to be the
| best at something. Be sure to optimise for this when necessary.
|
| 2) Push predictions, don't ask others to pull them. Focus on
| decoupling your data science team and their customers early on.
| The worst thing that can happen is duplicating logic, first to
| craft features during training, and later to submit those
| features to an API from clients. Even if you hide the feature
| engineering behind the API, this can either slow down
| predictions, or still require bulky requests from the client in
| the case of sequence data. Instead, stream data into your feature
| store, and stream predictions out onto your event bus. Then your
| data science team can truly be a black box.
|
| 3) Unit test invariants in your model's outputs. While you can't
| write tests for exact outputs, you can say "such and such a case
| should output a higher value than some other case, all things
| being equal". When your model disagrees, do at least consider
| that the model may be correct though.
|
| 4) Do ablation tests in reverse, and unit test each addition to
| your model's architecture to prove it helps.
|
| 5) Often you will train a model on historical data, and content
| yourself that all future predictions will be outside this
| training set. However, don't forget that sometimes updates to
| historical data will trigger a prediction to be recalculated, and
| this might be overfit. Sometimes you can serve cached results,
| but small feature changes make this harder.
|
| 6) Your data scientists are probably the people who are most
| intimate with your data. They will be the first to stumble on
| bugs and biases, so give them very good channels to report QA
| issues. If you are a lone data scientist in a larger
| organisation, seek out and forge these channels early.
|
| 7) Don't treat labelling tools as grubby little hacked together
| apps. Resource them properly, make sure you watch and listen to
| the humans building and using them.
|
| 8) Have objective ways of comparing models that are thematically
| similar but may differ in their exact goal variables. If you
| can't directly compare log loss or whatever like-for-like, find
| some more external criteria.
|
| 9) Much of your job is building trust in your models with
| stakeholders. Don't be afraid to build simple stuff that captures
| established intuitions before going deep - show people the
| machine gets the basics first.
|
| 10) If you're struggling to replicate a result from a paper,
| either with or without the original code, welcome to academia.
|
| Probably not earth shattering stuff, I grant you.
| cobertos wrote:
| I really liked this post. Tons of small tidbits I feel I'd only
| get from working in the teams they worked on. Some things I
| noted:
|
| * Check out `ray` as an alternative to `multiprocessing`
|
| * Check out `tqdm`
|
| * Use `pdb` more
|
| * See if fast.ai or https://jalammar.github.io/illustrated-
| transformer/ are worthwhile
|
| * Prioritize the papers I read better
|
| * _Leveraged_ index funds?
| chriak8292 wrote:
| Risk-adjusted return (Sharpe)
|
| - VTI/VOO: 0.6 to 0.8
|
| - Leveraged S&P500 ETFs: 0.5 to 0.9
|
| - Savings account: positive infinity
|
| - Treasury bonds: 1.0 to 1.2
|
| Leveraged index funds give approximately the same risk-adjusted
| return as passive index funds but with 2-2.5x the absolute
| return (eg, UPRO).
|
| However, if you have a weak stomach (you don't like seeing your
| balance drop), then leveraged ETFs are not for you.
| cauthon wrote:
| Check out ipdb instead of pdb. It's pdb but with the ipython
| repl instead of python's
| ZephyrBlu wrote:
| I was hoping for more DS related stuff. It almost sounds like
| you're learning to be a SWE!
|
| The investing section is curious.
|
| > _On brilliant advice from the man who arguably went from mere
| millions to decabillions faster than anyone in modern history, I
| 've put a large chunk of my money in leveraged index funds and
| etfs_
|
| Who are you referencing here and did you do any DD other than
| taking his advice?
|
| I'm wondering what the downsides of leveraged index funds and
| ETFs are since I'm not sure how they work.
| dawndrain wrote:
| > did you do any DD other than taking his advice?
|
| I did some backtesting simulations that made leveraged
| investing look pretty awesome. The effective borrow rate for
| funds like spxl is crazy low, way better than if I were to
| borrow myself. (Also, fwiw I was pretty conservative and am
| overall only around 2x-leveraged.)
|
| The internet is very opposed to leveraged investing imo, but I
| think most of the concerns are pretty dumb. There was this one
| blog post where this guy ran ten simulations of his own, most
| of which showed the leveraged portfolio doing comparably to the
| baseline, but one a couple showed it doing worse and one saw
| the leveraged portfolio 100x'ing or something... and he
| concluded that it wasn't worth it??
|
| People will also appeal to volatility drag as a superficially
| sophisticated knockdown (in short, imagine all four two-step
| paths in which the market goes up or down by 10% at each step.
| Then the baseline market averages out to (.81 + .99 + .99 +
| 1.21)/4 = 1, and a 3x leveraged portfolio averages out to (.49
| + .91 + .91 + 1.69)/4 = 1. Volatility drag is those two middle
| worlds where the leveraged portfolio does badly despite the
| market as a whole basically ending up where it started.
| ZephyrBlu wrote:
| Thanks for you perspective. I'm considering using leverage so
| it's interesting to hear from people who are currently using
| it.
|
| Also, you might find this tweet and paper interesting:
|
| - Tweet:
| https://twitter.com/patio11/status/1432891941138563077
|
| - Paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=1
| 0.1.1.89...
| akg_67 wrote:
| Use keyword "HedgeFundie" (username that first started the
| discussion on Bogleheads and now most refer this strategy
| with his name) to search discussions about Leveraged ETFs
| on Bogleheads forum and Reddit. There are over 300+ pages
| worth of discussion on this topic only on Bogleheads.
|
| Also checkout information about permanent portfolio put
| forward by Bridgewater Associates.
| dawndrain wrote:
| > I was hoping for more DS related stuff. It almost sounds like
| you're learning to be a SWE!
|
| It kind of felt that way too :). Some more data sciency things
| were learning how transformers work, hacking pytorch, using
| visualization tools like tensorboard and wandb, web scraping,
| better using parallelism, tuning hyperparameters (mostly the
| learning rate tbh), better fluency with the command line than I
| assume most swe's need, getting very comfortable inspecting
| data, making experiments more reproducible, reading lots of
| papers, writing papers, and trying (somewhat half-heartedly) to
| get published.
| krrrh wrote:
| The leverage means they go up faster, but they also go down
| faster.
|
| In regular investing you feel good when you get a modest
| return, and bad when you experience a modest loss. With
| leveraged funds you feel like a genius when the market goes up
| and like a complete moron when you get wiped out.
| ZephyrBlu wrote:
| I understand how leverage works, just not in the context of
| an index fund or ETF.
|
| From your description it sounds like the leverage is baked
| into it, so it's kind of like "safe" leverage in that you
| can't lose more than you have.
| kbenson wrote:
| Almost definitely Warren Buffett, who went so far as to make a
| very public bet that index funds performed better than hedge
| funds (and thus most active investing). You can see details on
| the deal and outcome here
| https://www.investopedia.com/articles/investing/030916/buffe...
|
| Also, he's got tens of billions.
|
| Edit: Or maybe not on reflection about the "faster than anyone"
| bit. I dunno.
| comp_throw7 wrote:
| Sam Bankman-Fried, unless I miss my guess.
| jupin wrote:
| > Google doesn't allow any production-level projects to be
| written in python due to safety concerns.
|
| Is this true?
| qsort wrote:
| > Google doesn't allow any production-level projects to be
| written in python due to safety concerns
|
| Is this actually true? If it's true that Google doesn't allow
| Python in production, it seems unlikely that it's due to security
| concerns.
| dawndrain wrote:
| I heard this second-hand, not totally sure it's true
| wheelinsupial wrote:
| > in fact Google doesn't allow any production-level projects
| to be written in python due to safety concerns.
|
| Then why did you write it as a fact?
| smueller1234 wrote:
| It's not.
|
| The remote connection to safety (did you mean security?)
| would be that static source analysis tools don't work as
| reliably with dynamic languages. That matters at Google. But
| you don't even have to think as hard about it: Python is
| simply comparatively slow and inefficient. Google's fleet is
| large. It pays off to use more efficient languages.
|
| (There's also the whole thing about Python being _largely_
| single threaded and computers being very wide these days, as
| well as being a terrible memory hog and memory making up half
| the cost of servers.)
| MontyCarloHall wrote:
| Their command line interfaces to Google Cloud Platform (e.g.
| gcloud/gsutil) is 100% Python. Is that not considered a
| "production level project"?
| [deleted]
___________________________________________________________________
(page generated 2021-09-04 23:00 UTC)