https://github.com/dynamicwebpaige/kaggle-survey-spelunking/blob/main/README.md

Skip to content
 
Sign up

  * Why GitHub?
    Features -
      + Mobile -
      + Actions -
      + Codespaces -
      + Packages -
      + Security -
      + Code review -
      + Project management -
      + Integrations -
      + GitHub Sponsors -
      + Customer stories-
  * Team
  * Enterprise
  * Explore
      + Explore GitHub -

    Learn and contribute

      + Topics -
      + Collections -
      + Trending -
      + Learning Lab -
      + Open source guides -

    Connect with others

      + The ReadME Project -
      + Events -
      + Community forum -
      + GitHub Education -
      + GitHub Stars program -
  * Marketplace
  * Pricing
    Plans -
      + Compare plans -
      + Contact Sales -
      + Education -

[                    ] [search-key]

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this user All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in Sign up
{{ message }}

dynamicwebpaige / kaggle-survey-spelunking

  * Notifications
  * Star 93
  * Fork 6

  * Code
  * Issues 0
  * Pull requests 0
  * Actions
  * Projects 0
  * Wiki
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Wiki
  * Security
  * Insights

Permalink
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

kaggle-survey-spelunking/README.md

Go to file

  * Go to file T
  * Go to line L
  * 
  * Copy path
  * Copy permalink

@dynamicwebpaige
dynamicwebpaige Updated README.
Latest commit 597acd7 Jun 19, 2021 History
1 contributor

Users who have contributed to this file

Machine Learning Cohorts: A Synthesis TL;DR Data Sources Methods
Results Demographics Tools Data Algorithms and Methods Production and
Cloud Machine Learning User Cohorts
125 lines (72 sloc) 13.5 KB
Raw Blame
   

  * Open with Desktop
  * View raw
  * View blame

 Machine Learning Cohorts:

 A Synthesis

"Data Scientist", "Machine Learning Developer", "Deep Learning
Engineer", "Data Engineer", "ML Ops Engineer", and "Data Analyst" are
often overloaded role titles -- and not necessarily indicative of a
user's day-to-day work, or the tools they are using to accomplish
that work.

To better understand and characterize these diverse user segments, we
can use tools, libraries, and frameworks referenced in the Kaggle:
State of Machine Learning and Data Science 2020 Survey to cluster
engineers into cohort groups. We can also loosely tie these cohorts
to their anticipated cloud spend; identify typical tasks each user
cohort is responsible for completing; assess compute and storage
requirements for each user cohort; and estimate cohort size, based on
survey responses.

 TL;DR

    Survey respondents are overwhelmingly performing exploratory
    analysis using small- to medium-sized data sets stored as flat
    files, on local machines. Machine learning projects - if ML is
    being attempted at all - are in early stages, using traditional
    methods that are best-suited for high-RAM CPU rather than GPU
    SKUs (ex: scikit-learn and clustering approaches).

    Based on responses, data science teams trend small (0-5
    engineers), with light rigor on SDLC best practices (ex: version
    control); and most data scientists come from non-CS backgrounds,
    with minimal programming experience. Preferred tools are
    overwhelmingly open-source and non-proprietary. If Visual Studio
    Code is being used by survey respondents, it is most often being
    used for non-interactive, production machine learning and data
    science work.

 Data Sources

The following surveys were included in the analysis:

                                        Applicable  Raw Data
   Survey            Description        # (out of  Available? Utility
                                          total)
              Annual survey of 6-7M
Kaggle State  registered Kaggle users.
of Data       Kaggle is the world's     20,036     Y          
Science and   largest online community
ML 2020       for machine learning and
              data science.
              Annual survey of
              StackOverflow users. Not
StackOverflow domain-specific; ~8% of
Developer     respondents indicated     5,200      Y          
Survey        doing data-affiliate work
              (data science, ML,
              research).
              Annual survey of Python
              developers, completed in
Python        partnership between the
Developers    PSF and JetBrains. Not    15,400     N          
Survey 2020   data science and
              ML-specific, though ~50%
              of respondents indicate
              they use Python for EDA.
              Focused survey for data
              analysts and ML engineers
Anaconda Data administered by Anaconda.
Science       Data not released         2,360      N          
Survey        publicly; but an
              executive summary is
              available.
              Not a domain-specific
              survey, and not segmented
              out by tools used. More
SlashData     than half of ML and data
Survey        science respondents (5K)  5,009      N          
              are hobbyists and
              students, and just
              learning how to do ML;
              not professionals.

Given the focused nature of the two survey instruments, the Anaconda
survey and the Kaggle surveys were both selected as the most useful
for the purpose of this analysis. The data science and machine
learning respondents from the Python Developers Survey (55% of
total); the professional data science and machine learning
respondents for the SlashData Survey (25% of total); and the data
science respondents from the StackOverflow Developer Survey (8% of
total) are used as supplemental evidence.

Though just under a third of total respondents for the Kaggle Survey
and the PSF Survey indicated that they were using VS Code, this was
found -- through qualitative interviews, as well as from social media
scraping and Github issues analysis -- not to be for exploratory data
analysis or interactive model building, but rather for machine
learning model deployment; for other types of software development or
Python library building; or for lightweight editing of Python and
markdown files.

[Screen]

The data is available to view via Github's Flat Data, and to download
from the Kaggle website.

 Methods

The Kaggle survey data was cleaned, and then one-hot encoded for each
developer tool based on survey responses. Tools used by less than 10%
of respondents were removed from the dataset. We then used UMAP
clustering with nearest neighbors of 32 to define clusters of users;
six distinct clusters were found and translated into cohorts, with no
apparent correlation to self-assigned role title.

[clusters]

Clusters were validated with the qualitative data in the Anaconda
Data Science Survey, as well as with blog and social media posts;
StackOverflow issues; and Github issues (for example: ML Ops
engineers tend to have backgrounds that fall more commonly on the
"software engineering" side of the spectrum).

 Results

Primary findings from the aggregated survey responses can be found
below. The numbers adjacent to each bullet point indicate which
survey above (1 through 5) supports each assessment.

 Demographics

  * Many of the survey respondents do not have a computer science
    background, but have been trained in some other domain (physical,
    natural or biological sciences; statistics; etc.) -- often
    obtaining a graduate or professional degree. [1,2,3,4]

  * Most survey respondents have been programming for less than a
    decade, and have less than three years of experience with machine
    learning or software engineering. [1,4]

  * The majority of survey respondents work in small teams (0-5
    engineers), or in large communities of practice (20+ engineers).
    These data scientists are not likely to be using version control
    systems; but do often indicate using GitHub as both a place to
    find code for their experiments, and a way to showcase their work
    . [1,4,5]

  * Most survey respondents appear to be in their late 20s or early
    30s, with 60% between 22 and 34. Only 20% are above the age of
    40; and there are signs of the numbers skewing even younger, as
    Generation Z becomes more involved with data science and machine
    learning work. Nearly 7% of Kaggle survey data scientists are
    aged 18-21, an increase from 5% in 2019. [1,2,4,5]

 Tools

  * Jupyter products (JupyterLab and original Jupyter notebooks) are
    the overwhelming winner in terms of IDE use (74.1%), with VS
    Code, PyCharm, and RStudio neck-in-neck for second place (all
    around 32%). It is common for survey respondents to use more than
    one development environment. [1,2,3,4,5]

  * Survey respondents prefer having a quick scratchpad that is
    automatically connected to data sources and does not require
    manual authentication. Though free-tier hosted notebooks (ex:
    Colab, Binder) are used by a subset of respondents, hosting and
    sharing code externally is not a P0. [1,4]

  * Data scientists overwhelmingly use open-source tools, not
    proprietary tools. [1,2,3,4,5]

 Data

  * Most survey respondents are using small to medium-sized data sets
    that can fit in memory. [1,4]

  * These data sets are usually comprised of local flat files (CSVs,
    JSON, etc.), or tables exported from relational databases. Data
    lakes and non-SQL databases are rarely used, if ever. [1,4]

  * Preferred databases are primarily open-source (PostgreSQL, MySQL,
    SQLite, etc.), though a significant number of users are opting
    for Microsoft SQL Server. [1]

  * Exploratory data analysis is a significant component of both data
    science and machine learning work; and is usually done using
    open-source libraries. Please note: EDA is distinct from and a
    precursor to ETL pipeline-building.[1,4]

  * Little to no exploratory data analysis is being done using large
    clusters of machines (Spark, Dask). If these tools are used, it
    is most commonly by the cohorts that are described in the table
    below as ML Ops professionals, data engineers, or deep learning
    engineers. [1,4]

 Algorithms and Methods

  * Most survey respondents are doing either exploratory data
    analysis, or traditional machine learning with scikit-learn.
    These models are most commonly logistic and linear regression;
    random forests and decision trees; Bayesian methods; and gradient
    boosted trees. [1,3,4]

  * It is common for data scientists to use more than one language -
    and the usual suspects are R, Python, and SQL. [1,2,3,4,5]

  * Most survey respondents are not using automated machine learning
    (AutoML) techniques, or experiment management and model
    orchestration tools (ex: Weights & Biases, MLFlow). [1,4]

 Production and Cloud

  * Most survey respondents are not yet using machine learning in
    production, though that number is steadily increasing year over
    year (28.9% in 2019 compared to 30.8% in 2020). [1,4,5]

  * Most survey respondents are not yet using self-hosted cloud
    technologies, though they often leverage third-party hosted
    notebooks (ex: Colab, Binder). [1]

Segmenting out Kaggle survey respondents who indicated spending more
than $100K (n = 729) on cloud resources, we find that:

  * There are a substantial number of respondents using Power BI and
    Tableau (22% and 30%, respectively).

  * Azure jumps to second place (31%) for survey respondents who
    indicate that they spend more than $100K on cloud resources. For
    survey respondents in aggregate, the second most popular cloud is
    GCP. The most popular cloud for both segments is AWS.

  * Large-scale cloud customers have even less of a focus on deep
    learning; if they are using machine learning at all, they are
    using traditional models.

  * Only half of large-scale cloud compute users (49%) are using GPUs
    - and, even for those users, those GPUs are local.

  * Survey respondents who indicated spending $100K or more on cloud
    resources were more likely to be tenured employees (5+ years of
    experience).

  * The other data points still hold: VS Code is in a distant second
    place to Jupyter* as an IDE; flat files and relational databases
    still most common data sources; most teams don't have machine
    learning models running in production and are still exploring;
    etc.

 Machine Learning User Cohorts

The survey data described above has been used to create a customer
cohort table (distilled view below).

Please note: this table is not meant to be a comprehensive assessment
of each of these cohort groups and the tools they used; just a brief
overview. Additional blog posts with a deep dive for each group to
follow.

     Cohort                  Description                Most Common
                                                           IDEs
                 New to programming, data science,    Hosted
                 and machine learning. Canonical      notebooks (ex:
Beginners (new   example would be high school and     Colab) or
to programming)  college students. Primary mechanism  Jupyter
                 to learn is video content (Coursera, notebooks
                 YouTube, EdX, etc.).
Beginners        New to data science and ML, and just PyCharm, VS
(current         beginning to learn. Most commonly    Code, Visual
software         come from an app development         Studio, other
engineers, new   background.                          software IDE
to ML)
                 Use data to help understand business
Data Analyst     problems or research questions.      Excel or Google
                 Minimal (if any) statistics          Sheets
                 background.
Data Scientist - Use data to help understand          Jupyter/
Business, EDA    business, logistics, or supply chain JupyterLab,
                 problems.                            RStudio
(Data) Scientist Use data to help understand problems MATLAB, Jupyter
- Academic, EDA  in the physical, biological, social, /JupyterLab,
                 or natural sciences.                 RStudio
                 Just beginning to use machine
Data Scientist - learning methods to solve business   Jupyter/
Business,        problems, and to complement          JupyterLab,
Traditional ML   exploratory data analysis            RStudio
                 techniques.
                 Just beginning to use machine
ML Researcher -  learning methods to solve research   Jupyter/
Academic,        questions, and to complement         JupyterLab,
Traditional ML   exploratory data analysis            MATLAB, RStudio
                 techniques.
Deep Learning    Similar to the traditional machine   Jupyter/
Researcher       learning segment; most comfortable   JupyterLab,
(small-scale)    with medium-sized data sets and      PyCharm
                 local machines.
ML Framework                                          Jupyter/
Builder, or      Authors of scikit-learn, Keras,      JupyterLab,
High-Level Deep  PyTorch Lightning, and other similar PyCharm or VS
Learning API     tools.                               Code
Builder
Deep Learning    The NeurIPS, ICML, and ICLR          Jupyter/
Engineer         contingent; these are the            JupyterLab,
(large-scale)    researchers you would expect to see  PyCharm or VS
                 hired at OpenAI, Google Brain, etc.  Code
Deep Learning    Authors of low-level APIs for
Framework        TensorFlow, JAX, and PyTorch;        PyCharm or VS
Builder          distributed training frameworks,     Code
                 like Ray; and similar.
                 The engineers who productionize ML   Visual Studio,
                 systems; responsible for running,    VS Code,
ML Ops           maintaining, and debugging ML        PyCharm (or
Practitioner     pipelines (from ETL through          other JetBrains
                 deployment). Usually do not have a   tools)
                 background in machine learning.

[                    ] Go
  * (c) 2021 GitHub, Inc.
  * Terms
  * Privacy
  * Security
  * Status
  * Docs

 

  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.