https://github.com/dynamicwebpaige/kaggle-survey-spelunking/blob/main/README.md Skip to content Sign up * Why GitHub? Features - + Mobile - + Actions - + Codespaces - + Packages - + Security - + Code review - + Project management - + Integrations - + GitHub Sponsors - + Customer stories- * Team * Enterprise * Explore + Explore GitHub - Learn and contribute + Topics - + Collections - + Trending - + Learning Lab - + Open source guides - Connect with others + The ReadME Project - + Events - + Community forum - + GitHub Education - + GitHub Stars program - * Marketplace * Pricing Plans - + Compare plans - + Contact Sales - + Education - [ ] [search-key] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} dynamicwebpaige / kaggle-survey-spelunking * Notifications * Star 93 * Fork 6 * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Wiki * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights Permalink main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags kaggle-survey-spelunking/README.md Go to file * Go to file T * Go to line L * * Copy path * Copy permalink @dynamicwebpaige dynamicwebpaige Updated README. Latest commit 597acd7 Jun 19, 2021 History 1 contributor Users who have contributed to this file Machine Learning Cohorts: A Synthesis TL;DR Data Sources Methods Results Demographics Tools Data Algorithms and Methods Production and Cloud Machine Learning User Cohorts 125 lines (72 sloc) 13.5 KB Raw Blame * Open with Desktop * View raw * View blame Machine Learning Cohorts: A Synthesis "Data Scientist", "Machine Learning Developer", "Deep Learning Engineer", "Data Engineer", "ML Ops Engineer", and "Data Analyst" are often overloaded role titles -- and not necessarily indicative of a user's day-to-day work, or the tools they are using to accomplish that work. To better understand and characterize these diverse user segments, we can use tools, libraries, and frameworks referenced in the Kaggle: State of Machine Learning and Data Science 2020 Survey to cluster engineers into cohort groups. We can also loosely tie these cohorts to their anticipated cloud spend; identify typical tasks each user cohort is responsible for completing; assess compute and storage requirements for each user cohort; and estimate cohort size, based on survey responses. TL;DR Survey respondents are overwhelmingly performing exploratory analysis using small- to medium-sized data sets stored as flat files, on local machines. Machine learning projects - if ML is being attempted at all - are in early stages, using traditional methods that are best-suited for high-RAM CPU rather than GPU SKUs (ex: scikit-learn and clustering approaches). Based on responses, data science teams trend small (0-5 engineers), with light rigor on SDLC best practices (ex: version control); and most data scientists come from non-CS backgrounds, with minimal programming experience. Preferred tools are overwhelmingly open-source and non-proprietary. If Visual Studio Code is being used by survey respondents, it is most often being used for non-interactive, production machine learning and data science work. Data Sources The following surveys were included in the analysis: Applicable Raw Data Survey Description # (out of Available? Utility total) Annual survey of 6-7M Kaggle State registered Kaggle users. of Data Kaggle is the world's 20,036 Y Science and largest online community ML 2020 for machine learning and data science. Annual survey of StackOverflow users. Not StackOverflow domain-specific; ~8% of Developer respondents indicated 5,200 Y Survey doing data-affiliate work (data science, ML, research). Annual survey of Python developers, completed in Python partnership between the Developers PSF and JetBrains. Not 15,400 N Survey 2020 data science and ML-specific, though ~50% of respondents indicate they use Python for EDA. Focused survey for data analysts and ML engineers Anaconda Data administered by Anaconda. Science Data not released 2,360 N Survey publicly; but an executive summary is available. Not a domain-specific survey, and not segmented out by tools used. More SlashData than half of ML and data Survey science respondents (5K) 5,009 N are hobbyists and students, and just learning how to do ML; not professionals. Given the focused nature of the two survey instruments, the Anaconda survey and the Kaggle surveys were both selected as the most useful for the purpose of this analysis. The data science and machine learning respondents from the Python Developers Survey (55% of total); the professional data science and machine learning respondents for the SlashData Survey (25% of total); and the data science respondents from the StackOverflow Developer Survey (8% of total) are used as supplemental evidence. Though just under a third of total respondents for the Kaggle Survey and the PSF Survey indicated that they were using VS Code, this was found -- through qualitative interviews, as well as from social media scraping and Github issues analysis -- not to be for exploratory data analysis or interactive model building, but rather for machine learning model deployment; for other types of software development or Python library building; or for lightweight editing of Python and markdown files. [Screen] The data is available to view via Github's Flat Data, and to download from the Kaggle website. Methods The Kaggle survey data was cleaned, and then one-hot encoded for each developer tool based on survey responses. Tools used by less than 10% of respondents were removed from the dataset. We then used UMAP clustering with nearest neighbors of 32 to define clusters of users; six distinct clusters were found and translated into cohorts, with no apparent correlation to self-assigned role title. [clusters] Clusters were validated with the qualitative data in the Anaconda Data Science Survey, as well as with blog and social media posts; StackOverflow issues; and Github issues (for example: ML Ops engineers tend to have backgrounds that fall more commonly on the "software engineering" side of the spectrum). Results Primary findings from the aggregated survey responses can be found below. The numbers adjacent to each bullet point indicate which survey above (1 through 5) supports each assessment. Demographics * Many of the survey respondents do not have a computer science background, but have been trained in some other domain (physical, natural or biological sciences; statistics; etc.) -- often obtaining a graduate or professional degree. [1,2,3,4] * Most survey respondents have been programming for less than a decade, and have less than three years of experience with machine learning or software engineering. [1,4] * The majority of survey respondents work in small teams (0-5 engineers), or in large communities of practice (20+ engineers). These data scientists are not likely to be using version control systems; but do often indicate using GitHub as both a place to find code for their experiments, and a way to showcase their work . [1,4,5] * Most survey respondents appear to be in their late 20s or early 30s, with 60% between 22 and 34. Only 20% are above the age of 40; and there are signs of the numbers skewing even younger, as Generation Z becomes more involved with data science and machine learning work. Nearly 7% of Kaggle survey data scientists are aged 18-21, an increase from 5% in 2019. [1,2,4,5] Tools * Jupyter products (JupyterLab and original Jupyter notebooks) are the overwhelming winner in terms of IDE use (74.1%), with VS Code, PyCharm, and RStudio neck-in-neck for second place (all around 32%). It is common for survey respondents to use more than one development environment. [1,2,3,4,5] * Survey respondents prefer having a quick scratchpad that is automatically connected to data sources and does not require manual authentication. Though free-tier hosted notebooks (ex: Colab, Binder) are used by a subset of respondents, hosting and sharing code externally is not a P0. [1,4] * Data scientists overwhelmingly use open-source tools, not proprietary tools. [1,2,3,4,5] Data * Most survey respondents are using small to medium-sized data sets that can fit in memory. [1,4] * These data sets are usually comprised of local flat files (CSVs, JSON, etc.), or tables exported from relational databases. Data lakes and non-SQL databases are rarely used, if ever. [1,4] * Preferred databases are primarily open-source (PostgreSQL, MySQL, SQLite, etc.), though a significant number of users are opting for Microsoft SQL Server. [1] * Exploratory data analysis is a significant component of both data science and machine learning work; and is usually done using open-source libraries. Please note: EDA is distinct from and a precursor to ETL pipeline-building.[1,4] * Little to no exploratory data analysis is being done using large clusters of machines (Spark, Dask). If these tools are used, it is most commonly by the cohorts that are described in the table below as ML Ops professionals, data engineers, or deep learning engineers. [1,4] Algorithms and Methods * Most survey respondents are doing either exploratory data analysis, or traditional machine learning with scikit-learn. These models are most commonly logistic and linear regression; random forests and decision trees; Bayesian methods; and gradient boosted trees. [1,3,4] * It is common for data scientists to use more than one language - and the usual suspects are R, Python, and SQL. [1,2,3,4,5] * Most survey respondents are not using automated machine learning (AutoML) techniques, or experiment management and model orchestration tools (ex: Weights & Biases, MLFlow). [1,4] Production and Cloud * Most survey respondents are not yet using machine learning in production, though that number is steadily increasing year over year (28.9% in 2019 compared to 30.8% in 2020). [1,4,5] * Most survey respondents are not yet using self-hosted cloud technologies, though they often leverage third-party hosted notebooks (ex: Colab, Binder). [1] Segmenting out Kaggle survey respondents who indicated spending more than $100K (n = 729) on cloud resources, we find that: * There are a substantial number of respondents using Power BI and Tableau (22% and 30%, respectively). * Azure jumps to second place (31%) for survey respondents who indicate that they spend more than $100K on cloud resources. For survey respondents in aggregate, the second most popular cloud is GCP. The most popular cloud for both segments is AWS. * Large-scale cloud customers have even less of a focus on deep learning; if they are using machine learning at all, they are using traditional models. * Only half of large-scale cloud compute users (49%) are using GPUs - and, even for those users, those GPUs are local. * Survey respondents who indicated spending $100K or more on cloud resources were more likely to be tenured employees (5+ years of experience). * The other data points still hold: VS Code is in a distant second place to Jupyter* as an IDE; flat files and relational databases still most common data sources; most teams don't have machine learning models running in production and are still exploring; etc. Machine Learning User Cohorts The survey data described above has been used to create a customer cohort table (distilled view below). Please note: this table is not meant to be a comprehensive assessment of each of these cohort groups and the tools they used; just a brief overview. Additional blog posts with a deep dive for each group to follow. Cohort Description Most Common IDEs New to programming, data science, Hosted and machine learning. Canonical notebooks (ex: Beginners (new example would be high school and Colab) or to programming) college students. Primary mechanism Jupyter to learn is video content (Coursera, notebooks YouTube, EdX, etc.). Beginners New to data science and ML, and just PyCharm, VS (current beginning to learn. Most commonly Code, Visual software come from an app development Studio, other engineers, new background. software IDE to ML) Use data to help understand business Data Analyst problems or research questions. Excel or Google Minimal (if any) statistics Sheets background. Data Scientist - Use data to help understand Jupyter/ Business, EDA business, logistics, or supply chain JupyterLab, problems. RStudio (Data) Scientist Use data to help understand problems MATLAB, Jupyter - Academic, EDA in the physical, biological, social, /JupyterLab, or natural sciences. RStudio Just beginning to use machine Data Scientist - learning methods to solve business Jupyter/ Business, problems, and to complement JupyterLab, Traditional ML exploratory data analysis RStudio techniques. Just beginning to use machine ML Researcher - learning methods to solve research Jupyter/ Academic, questions, and to complement JupyterLab, Traditional ML exploratory data analysis MATLAB, RStudio techniques. Deep Learning Similar to the traditional machine Jupyter/ Researcher learning segment; most comfortable JupyterLab, (small-scale) with medium-sized data sets and PyCharm local machines. ML Framework Jupyter/ Builder, or Authors of scikit-learn, Keras, JupyterLab, High-Level Deep PyTorch Lightning, and other similar PyCharm or VS Learning API tools. Code Builder Deep Learning The NeurIPS, ICML, and ICLR Jupyter/ Engineer contingent; these are the JupyterLab, (large-scale) researchers you would expect to see PyCharm or VS hired at OpenAI, Google Brain, etc. Code Deep Learning Authors of low-level APIs for Framework TensorFlow, JAX, and PyTorch; PyCharm or VS Builder distributed training frameworks, Code like Ray; and similar. The engineers who productionize ML Visual Studio, systems; responsible for running, VS Code, ML Ops maintaining, and debugging ML PyCharm (or Practitioner pipelines (from ETL through other JetBrains deployment). Usually do not have a tools) background in machine learning. [ ] Go * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.