https://www.slideshare.net/GaelVaroquaux/dirty-data-science-machine-learning-on-noncurated-data Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy. Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details. SlideShare * Home * Explore [ ] Submit Search * Upload * Login * Signup SlideShare * Home * Explore * Login * Signup [ ] Successfully reported this slideshow. We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime. Dirty data science machine learning on non-curated data Dirty data science machine learning on non-curated dataGael Varoquaux, Dirty data science machine learning on non-curated dataGael Varoquaux, Industry challenges to data sciencewww.kaggle.com /ash316/novice-to-grandmaster Industry challenges to data sciencewww.kaggle.com/ash316/novice-to-grandmasterOn some dirty-data problems,progress in m... Talk outline1 What models cannot fit2 Learning with missing values3 Machine learning on dirty categoriesG Varoquaux 3 1 What models cannot fitOutside of statistics' comfort zone (X [?] Rnxp)G Varoquaux 4 1 The full life-cycle of a data-science projectFraming the domain questionFinding and understanding the dataAssembling ... 1 Understanding the data, between human and machineAge60263813952861748Just numbersG Varoquaux 6 1 Understanding the data, between human and machineAge602638?? 13952861748Numbers with ameaningA numerical col... 1 Understanding the data, between human and machineAge Name60 Bono26 Justin Bieber38 Giselle Knowles-Carter?139 Pablo... 1 Understanding the data, between human and machineAge Name Born in Activity60 Bono Ireland Singer26 Justin Bieber Cana... 1 Understanding the data, between human and machineAge Name Born in Activity60 Bono Ireland Singer26 Justin Bieber Cana... 1 Assembling data, of different natures and sourcesAge Name Position60 John Doe Electrician48 Jane Austen Senior Profes... 1 Aggregations - long vs wide tablesPerson ID Measure type Value12345 Blood Pressure 13945673 Sugar Level 11312345 Hea... 1 Data wrangling: assembling unfamiliar sourcesRelational algebra:joinsaggregations (# coffees a day)selections (findi... 1 Data wrangling: assembling unfamiliar sourcesRelational algebra:joinsaggregations (# coffees a day) selections (findi... 1 Systematic errors: data require external checksMeasurement biases:Volunteer biasMore womenvolunteer in medicalstudi... 1 Systematic errors: data require external checksMeasurement biases:Volunteer biasMore womenvolunteer in medicalstudi... 1 Systematic errors: data require external checksMeasurement biases:Volunteer biasMore womenvolunteer in medicalstudi... 1 Systematic errors: data require external checksMeasurement biases:Volunteer biasMore womenvolunteer in medicalstudi... Data-science is much more than fitting a statistical modelData require assembling informationDifferent data sources = di... 2 Learning with missing values[Josse... 2019]Gender Date Hired Employee Position TitleM 09/12/1988 Master Police Office... Why doesn't the #$@! machine learning toolkit work?!Machine learning models need entries in a vector space (or at leasta... Why doesn't the #$@! machine learning toolkit work?!Machine learning models need entries in a vector space (or at leasta... 2 Classic statistics points of viewModel a) a distribution fth for the complete data xModel b) a random process gph occlud... 2 Classic statistics points of viewModel a) a distribution fth for the complete data xModel b) a random process gph occlud... 2 Classic statistics points of viewModel a) a distribution fth for the complete data xModel b) a random process gph occlud... 2 Classic statistics points of viewModel a) a distribution fth for the complete data xModel b) a random process gph occlud... 2 ImputationFill in information Gender Date Hired Employee Position TitleM 09/12/1988 Master Police OfficerF NA-2000 S... 2 Imputation and prediction with test-time missing valuesSettings: y = f (x) + eTheorem [Josse... 2019]f : trained pred... 2 Imputation procedures that work out of sampleMean imputation special case of univariate imputationReplace NA by the me... 2 Imputation procedures that work out of sampleMean imputation special case of univariate imputationReplace NA by the me... 2 Imputation procedures that work out of sampleMean imputation special case of univariate imputationReplace NA by the me... 2 Constant imputation for supervised learningTheorem [Josse... 2019]For a powerful learner (universally consistent) impu... 2 Constant imputation for supervised learningTheorem [Josse... 2019]For a powerful learner (universally consistent) impu... 2 Imputation for supervised learningSimulation: MCAR + Gradient boosting102 103 104Sample size0.650.700.750.80r2s... 2 Imputation is not enough: predictive missingnessPathological case [Josse... 2019]y depends only on wether data is miss... 2 Imputation is not enough: predictive missingnessPathological case [Josse... 2019]y depends only on wether data is miss... 2 Tree models with missing valuesMIA (Missing Incorporated Attribute)[Josse... 2019] x10 < -1.5 ?x2< 2 ?Yes/Missingx7<... 2 Tree models with missing values (MCAR)Simulation: MCAR + Gradient boosting102 103 104Sample size0.700.750.80r2sc... 2 Tree models with missing values (censored) Simulation: MCAR + Gradient boosting102 103 104Sample size0.70.80.9r2s... 2 Neural networks with missing valuesGradient-based optimization of continuous modelsDifficulty: Half-discrete input spa... 2 Neumiss network: adapted neural architecture [Le Morvan... 2020]Neural networks that approximate optimal predictors (fu... 2 Neumiss network: adapted neural architecture [Le Morvan... 2020]Neural networks that approximate optimal predictors (fu... 2 Neumiss network: adapted neural architecture [Le Morvan... 2020]Neural networks that approximate optimal predictors (fu... Learning with missing valuesImputation is motivated only in MAR settingsRather than a sophisticated imputation,use a po... 3 Machine learning on dirty categories [Cerda... 2018, Cerda and Varoquaux 2020]Employee Position TitleMaster Police Off... 3 Categorical entries in a statistical modelEmployee Position TitleMaster Police OfficerSocial Worker IVPolice Officer... 3 Non-normalized categorical entries in a statistical modelEmployee Position TitleMaster Police OfficerSocial Worker IV... 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categoriesRepresent each category by the averag... 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categoriesRepresent each category by the averag... 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categoriesRepresent each category by the averag... 3 Data curation Database normalizationFeature engineeringEmployee Position TitleMaster Police OfficerSocial Worker III... 3 Data curation Database normalizationFeature engineeringEmployee Position TitleMaster Police OfficerSocial Worker III... 3 Data curation Database normalizationFeature engineeringEmployee Position TitleMaster Police OfficerSocial Worker III... Our goal: supervised learning on dirty categoriesThe statistical question shouldinform curationPfizer Corporation Hong ... 3 Adding similarities to one-hot encodingOne-hot encodingLondon Londres ParisLondres 0 1 0London 1 0 0Paris 0 0 1X [?]... 3 Some string similaritiesLevenshteinNumber of edit on one string to match the otherJaro-Winklerdjaro(s1, s2) = m3| s1... 3 Python implementation: DirtyCatDirtyCat: Dirty category software:http://dirty-cat.github.iofrom d i r t y c a t impor... 3 Dirty categories blow up dimensionG Varoquaux 36 3 Dirty categories blow up dimensionNew words innatural languageG Varoquaux 36 3 Dirty categories blow up dimensionNew words innatural languageX [?] Rnxp, p is largeStatistical problemsComputationa... 3 Tackling the high cardinalitySimilarity encoding, one-hot encoding= Prototype methodsHow to choose a small number of ... 3 Tackling the high cardinalitySimilarity encoding, one-hot encoding= Prototype methodsHow to choose a small number of ... 3 Substring informationDrug Namealcoholethyl alcoholisopropyl alcoholpolyvinyl alcoholisopropyl alcohol swab62% eth... 3 Modeling substrings [Cerda and Varoquaux 2020]Model on sub-strings(GaP: Gamma-Poisson factorization)| {z }3-gram1L... 3 Latent category model [Cerda and Varoquaux 2020]Topic model on sub-strings(GaP: Gamma-Poisson factorization)| {z }3-... 3 String models of latent categories [Cerda and Varoquaux 2020]Encodingsthat extractlatentcategoriesbraryrato... 3 String models of latent categories [Cerda and Varoquaux 2020] Inferringplausiblefeaturenamesstant,librar... 3 Data science with dirty categories0.0 0.1 0.2Information, Technology, TechnologistOfficer, Office, PoliceLiquor, Clerk, ... Learning does not require clean entitiesModel continuous similarities across entriesSub-string models can capture theses... @GaelVaroquauxMachine learning with dirty dataWhat models cannot fitDirty categoriesMissing valuesUnderstanding and f... 4 References IP. Cerda and G. Varoquaux. Encoding high-cardinality string categoricalvariables. Transactions in Data and... 4 References IIM. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictoron linearly-generated dat... You've finished this document. Download and read it offline. Download Now Upcoming SlideShare What to Upload to SlideShare Next 1 1 of 74 Download Now Download Like this presentation? Why not share! * Share * Email * * * What to Upload to SlideShare by SlideShare 10241551 views * Be A Great Product Leader (Amplify,... by Adam Nash 1619633 views * Trillion Dollar Coach Book (Bill Ca... by Eric Schmidt 1650639 views * APIdays Paris 2019 - Innovation @ s... by apidays 2158574 views * A few thoughts on work life-balance by Wim Vanderbauwhede 1511187 views * Is vc still a thing final by Mark Suster 1392640 views Share SlideShare --------------------------------------------------------------------- * Facebook * Twitter * LinkedIn Embed [ ] Size (px) [] Start on [] [*] Show related SlideShares at end WordPress Shortcode [ ] Link [ ] of Prev SlideShare Next SlideShares Dirty data science machine learning on non-curated data Slide 1 Dirty data science machine learning on non-curated data Slide 2 Dirty data science machine learning on non-curated data Slide 3 Dirty data science machine learning on non-curated data Slide 4 Dirty data science machine learning on non-curated data Slide 5 Dirty data science machine learning on non-curated data Slide 6 Dirty data science machine learning on non-curated data Slide 7 Dirty data science machine learning on non-curated data Slide 8 Dirty data science machine learning on non-curated data Slide 9 Dirty data science machine learning on non-curated data Slide 10 Dirty data science machine learning on non-curated data Slide 11 Dirty data science machine learning on non-curated data Slide 12 Dirty data science machine learning on non-curated data Slide 13 Dirty data science machine learning on non-curated data Slide 14 Dirty data science machine learning on non-curated data Slide 15 Dirty data science machine learning on non-curated data Slide 16 Dirty data science machine learning on non-curated data Slide 17 Dirty data science machine learning on non-curated data Slide 18 Dirty data science machine learning on non-curated data Slide 19 Dirty data science machine learning on non-curated data Slide 20 Dirty data science machine learning on non-curated data Slide 21 Dirty data science machine learning on non-curated data Slide 22 Dirty data science machine learning on non-curated data Slide 23 Dirty data science machine learning on non-curated data Slide 24 Dirty data science machine learning on non-curated data Slide 25 Dirty data science machine learning on non-curated data Slide 26 Dirty data science machine learning on non-curated data Slide 27 Dirty data science machine learning on non-curated data Slide 28 Dirty data science machine learning on non-curated data Slide 29 Dirty data science machine learning on non-curated data Slide 30 Dirty data science machine learning on non-curated data Slide 31 Dirty data science machine learning on non-curated data Slide 32 Dirty data science machine learning on non-curated data Slide 33 Dirty data science machine learning on non-curated data Slide 34 Dirty data science machine learning on non-curated data Slide 35 Dirty data science machine learning on non-curated data Slide 36 Dirty data science machine learning on non-curated data Slide 37 Dirty data science machine learning on non-curated data Slide 38 Dirty data science machine learning on non-curated data Slide 39 Dirty data science machine learning on non-curated data Slide 40 Dirty data science machine learning on non-curated data Slide 41 Dirty data science machine learning on non-curated data Slide 42 Dirty data science machine learning on non-curated data Slide 43 Dirty data science machine learning on non-curated data Slide 44 Dirty data science machine learning on non-curated data Slide 45 Dirty data science machine learning on non-curated data Slide 46 Dirty data science machine learning on non-curated data Slide 47 Dirty data science machine learning on non-curated data Slide 48 Dirty data science machine learning on non-curated data Slide 49 Dirty data science machine learning on non-curated data Slide 50 Dirty data science machine learning on non-curated data Slide 51 Dirty data science machine learning on non-curated data Slide 52 Dirty data science machine learning on non-curated data Slide 53 Dirty data science machine learning on non-curated data Slide 54 Dirty data science machine learning on non-curated data Slide 55 Dirty data science machine learning on non-curated data Slide 56 Dirty data science machine learning on non-curated data Slide 57 Dirty data science machine learning on non-curated data Slide 58 Dirty data science machine learning on non-curated data Slide 59 Dirty data science machine learning on non-curated data Slide 60 Dirty data science machine learning on non-curated data Slide 61 Dirty data science machine learning on non-curated data Slide 62 Dirty data science machine learning on non-curated data Slide 63 Dirty data science machine learning on non-curated data Slide 64 Dirty data science machine learning on non-curated data Slide 65 Dirty data science machine learning on non-curated data Slide 66 Dirty data science machine learning on non-curated data Slide 67 Dirty data science machine learning on non-curated data Slide 68 Dirty data science machine learning on non-curated data Slide 69 Dirty data science machine learning on non-curated data Slide 70 Dirty data science machine learning on non-curated data Slide 71 Dirty data science machine learning on non-curated data Slide 72 Dirty data science machine learning on non-curated data Slide 73 Dirty data science machine learning on non-curated data Slide 74 Upcoming SlideShare What to Upload to SlideShare Next Download Now Download to read offline and view in fullscreen. Engineering Oct. 26, 2021 10,447 views 1 Share Dirty data science machine learning on non-curated data Download Now Download Download to read offline Engineering Oct. 26, 2021 10,447 views These slides are a one-hour course on machine learning with non-curated data. According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning. Read more Gael Varoquaux Gael Varoquaux Follow Researcher at INRIA Recommended * Representation learning in limited-data settings Representation learning in limited-data settings Gael Varoquaux * AI, electronic records, and health AI, electronic records, and health Gael Varoquaux * Better neuroimaging data processing: driven by evidence, open communities, and careful engineering Better neuroimaging data processing: driven by evidence, open communities, an... Gael Varoquaux * Democratizing machine learning: perspective from scikit-learn Democratizing machine learning: perspective from scikit-learn Gael Varoquaux * Representation learning in limited-data settings Representation learning in limited-data settings Gael Varoquaux * Machine learning on non curated data Machine learning on non curated data Gael Varoquaux * Functional-connectome biomarkers to meet clinical needs? Functional-connectome biomarkers to meet clinical needs? Gael Varoquaux * Atlases of cognition with large-scale human brain mapping Atlases of cognition with large-scale human brain mapping Gael Varoquaux * Similarity encoding for learning on dirty categorical variables Similarity encoding for learning on dirty categorical variables Gael Varoquaux * Machine learning for functional connectomes Machine learning for functional connectomes Gael Varoquaux Related Books Free with a 30 day trial from Scribd See all Bezonomics: How Amazon Is Changing Our Lives and What the World's Best Companies Are Learning from It Brian Dumaine (4.5/5) Free So You Want to Start a Podcast: Finding Your Voice, Telling Your Story, and Building a Community That Will Listen Kristen Meinzer (4/5) Free No Filter: The Inside Story of Instagram Sarah Frier (4.5/5) Free Autonomy: The Quest to Build the Driverless Car--And How It Will Reshape Our World Lawrence D. Burns (5/5) Free Live Work Work Work Die: A Journey into the Savage Heart of Silicon Valley Corey Pein (4/5) Free From Gutenberg to Google: The History of Our Future Tom Wheeler (2/5) Free SAM: One Robot, a Dozen Engineers, and the Race to Revolutionize the Way We Build Jonathan Waldman (5/5) Free Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think James Vlahos (3.5/5) Free The Future Is Faster Than You Think: How Converging Technologies Are Transforming Business, Industries, and Our Lives Peter H. Diamandis (5/5) Free Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Seth Stephens-Davidowitz (4/5) Free Life After Google: The Fall of Big Data and the Rise of the Blockchain Economy George Gilder (4/5) Free Future Presence: How Virtual Reality Is Changing Human Connection, Intimacy, and the Limits of Ordinary Life Peter Rubin (4/5) Free The Basics of Bitcoins and Blockchains: An Introduction to Cryptocurrencies and the Technology that Powers Them (Cryptography, Crypto Trading, Derivatives, Digital Assets) Antony Lewis (4/5) Free Wizard:: The Life and Times of Nikolas Tesla Marc Seifer (3/5) Free On War: With linked Table of Contents Carl Von Clausewitz (4.5/5) Free Milk!: A 10,000-Year Food Fracas Mark Kurlansky (3/5) Free Related Audiobooks Free with a 30 day trial from Scribd See all Driven: The Race to Create the Autonomous Car Alex Davies (4.5/5) Free A Brief History of Motion: From the Wheel, to the Car, to What Comes Next Tom Standage (4.5/5) Free An Ugly Truth: Inside Facebook's Battle for Domination Sheera Frenkel (4.5/5) Free The Quiet Zone: Unraveling the Mystery of a Town Suspended in Silence Stephen Kurczy (4.5/5) Free System Error: Where Big Tech Went Wrong and How We Can Reboot Rob Reich (4/5) Free The Wires of War: Technology and the Global Struggle for Power Jacob Helberg (4.5/5) Free If Then: How the Simulmatics Corporation Invented the Future Jill Lepore (4.5/5) Free The Science of Time Travel: The Secrets Behind Time Machines, Time Loops, Alternate Realities, and More! Elizabeth Howell (2.5/5) Free Liftoff: Elon Musk and the Desperate Early Days That Launched SpaceX Eric Berger (5/5) Free The Players Ball: A Genius, a Con Man, and the Secret History of the Internet's Rise David Kushner (4.5/5) Free Bitcoin Billionaires: A True Story of Genius, Betrayal, and Redemption Ben Mezrich (4.5/5) Free Lean Out: The Truth About Women, Power, and the Workplace Marissa Orr (4.5/5) Free Uncanny Valley: A Memoir Anna Wiener (4/5) Free Blockchain: The Next Everything Stephen P. Williams (4/5) Free A World Without Work: Technology, Automation, and How We Should Respond Daniel Susskind (4.5/5) Free User Friendly: How the Hidden Rules of Design Are Changing the Way We Live, Work, and Play Cliff Kuang (4/5) Free * 1 Like * Statistics * Notes * wcyee wcyee 4 hours ago Views Total views 10,447 On SlideShare 0 From Embeds 0 Number of Embeds 212 Actions Shares 0 Downloads 61 Comments 0 Likes 1 No notes for slide Dirty data science machine learning on non-curated data 1. 1. Dirty data science machine learning on non-curated data Gael Varoquaux, 2. 2. Dirty data science machine learning on non-curated data Gael Varoquaux, 3. 3. Industry challenges to data science www.kaggle.com/ash316/ novice-to-grandmaster 4. 4. Industry challenges to data science www.kaggle.com/ash316/ novice-to-grandmaster On some dirty-data problems, progress in machine learning can ease the pain 5. 5. Talk outline 1 What models cannot fit 2 Learning with missing values 3 Machine learning on dirty categories G Varoquaux 3 6. 6. 1 What models cannot fit Outside of statistics' comfort zone (X [?] Rnxp ) G Varoquaux 4 7. 7. 1 The full life-cycle of a data-science project Framing the domain question Finding and understanding the data Assembling and reshaping it Designing an AI / statistical model? Evaluating model performance Inspecting the model for unwanted behavior Bringing the model to stakeholders / production ?: what we think is cool G Varoquaux 5 8. 8. 1 Understanding the data, between human and machine Age 60 26 38 139 52 86 17 48 Just numbers G Varoquaux 6 9. 9. 1 Understanding the data, between human and machine Age 60 26 38 ?? 139 52 86 17 48 Numbers with a meaning A numerical column expresses a quantity, with a corresponding scale... G Varoquaux 6 10. 10. 1 Understanding the data, between human and machine Age Name 60 Bono 26 Justin Bieber 38 Giselle Knowles-Carter? 139 Pablo Picasso 52 Celine Dion 86 Leonard Cohen 17 Greta Thunberg 48 Justin Trudeau ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers G Varoquaux 6 11. 11. 1 Understanding the data, between human and machine Age Name Born in Activity 60 Bono Ireland Singer 26 Justin Bieber Canada Singer 38 Giselle Knowles-Carter? USA Singer 139 Pablo Picasso Spain Painter 52 Celine Dion Canada Singer 86 Leonard Cohen Canada Singer 17 Greta Thunberg Sweden Activist 48 Justin Trudeau Sweden Politician ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers They can be used to bring in additional information (features) G Varoquaux 6 12. 12. 1 Understanding the data, between human and machine Age Name Born in Activity 60 Bono Ireland Singer 26 Justin Bieber Canada Singer 38 Giselle Knowles-Carter? USA Singer 139 Pablo Picasso Spain Painter 52 Celine Dion Canada Singer 86 Leonard Cohen Canada Singer 17 Greta Thunberg Sweden Activist 48 Justin Trudeau Sweden Politician ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers They can be used to bring in additional information (features) And find errors Knowledge representation, relational algebra G Varoquaux 6 13. 13. 1 Assembling data, of different natures and sources Age Name Position 60 John Doe Electrician 48 Jane Austen Senior Professor 52 Jack Daniels Professor Position Salary Electrician 35 lizards Professor 13 horses Senior Professor 1 dragon To model the link between age and salary, a join is necessary Databases: To maintain consistency and min- imize storage, data are normal- ized: multiple tables are use to minimize redundancy. Statistics: Needs samples and features: mul- tiple observations of the same kind = data is denormalized in 1 table Age Name Position Salary Coffees/day 60 John Doe Electrician 35 lizards 2 48 Jane Austen Senior Professor 1 dragon 128 G Varoquaux 7 14. 14. 1 Aggregations - long vs wide tables Person ID Measure type Value 12345 Blood Pressure 139 45673 Sugar Level 113 12345 Heart Rate 71 45673 Blood Pressure 84 Long table Flexible data representation Person Blood Sugar Heart Rate ID Pressure Level Rate 12345 139 NA 71 45673 84 113 NA Wide table Amenable to statistics on Person Long to wide in Pandas: unstack, pivot Also: count coffes per day per person from coffee-machine logs G Varoquaux 8 15. 15. 1 Data wrangling: assembling unfamiliar sources Relational algebra: joins aggregations (# coffees a day) selections (finding the data) Challenges: understanding the data store and domain logic errors in the data (correspondances in names) Age Name Country Position Coffees/day 48 Justin Trudeau Canada Prime minister 3000 NA Gael Varoquaux NA NA NA G Varoquaux 9 16. 16. 1 Data wrangling: assembling unfamiliar sources Relational algebra: joins aggregations (# coffees a day) selections (finding the data) Challenges: understanding the data store and domain logic errors in the data (correspondances in names) In health: Assembling information across large electronic health records systems G Varoquaux 9 17. 17. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies G Varoquaux 10 18. 18. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) G Varoquaux 10 19. 19. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) Survival bias Data loss related to the process under study (survival models) G Varoquaux 10 20. 20. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) Survival bias Data loss related to the process under study (survival models) Partly addressed by machine-learning models for dataset shift (transfer learning) if you know the bias. Brings us back to understanding the data G Varoquaux 10 21. 21. Data-science is much more than fitting a statistical model Data require assembling information Different data sources = different conventions Measurements come with errors and biases These challenges require domain knowledge and data wrangling G Varoquaux 11 22. 22. 2 Learning with missing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/ 2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26 /2000 Library Assistant I M NA Library Assistant I G Varoquaux 12 23. 23. Why doesn't the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA / [?] R More than an implementation problem G Varoquaux 13 24. 24. Why doesn't the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA / [?] R More than an implementation problem Categorical entries are discrete anyhow For missing values in categorical variables, create a special categorie "missing". Rest of talk on NA in numerical variables G Varoquaux 13 25. 25. 2 Classic statistics points of view Model a) a distribution fth for the complete data x Model b) a random process gph occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] observed(x0 , mi) = observed(xi, mi) = gph(mi|x0 ) = gph(mi| xi) Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). G Varoquaux 14 26. 26. 2 Classic statistics points of view Model a) a distribution fth for the complete data x Model b) a random process gph occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable G Varoquaux 14 27. 27. 2 Classic statistics points of view Model a) a distribution fth for the complete data x Model b) a random process gph occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 2 2 0 2 MNAR G Varoquaux 14 28. 28. 2 Classic statistics points of view Model a) a distribution fth for the complete data x Model b) a random process gph occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 2 2 0 2 MNAR But There isn't always an unobserved value Age of spouse of singles? Machine-learning's goal is not to maximize likelihoods G Varoquaux 14 29. 29. 2 Imputation Fill in information Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA -2000 Social Worker IV M 07/16/2007 Police Officer III M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA -2012 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA -2014 Library Assistant I Large statistical literature Procedures and results focused on in sample settings How about completing the test set with the train set? What to do with the prediction target y? G Varoquaux 15 30. 30. 2 Imputation and prediction with test-time missing values Settings: y = f (x) + e Theorem [Josse... 2019] f : trained predictor achieving Bayes risk on full data Conditional multiple imputation achieves Bayes risk on test set with missing data (in MAR settings) f ? mult imput(x) = Exm|Xo=xo [f (xm, Xo)]. Notations: x [?] (R [?] NA)p : data at hand xo: observed values xm: unobserved values G Varoquaux 16 31. 31. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute G Varoquaux 17 32. 32. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer bad computational scalability G Varoquaux 17 33. 33. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer bad computational scalability Classic statistics point of view Mean imputation is disastrous, be- cause it disorts the distribution "Congeniality" conditions: good im- putation must preserve data propeties used by later analysis steps 2 0 2 3 2 1 0 1 2 3 G Varoquaux 17 34. 34. 2 Constant imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner "recognizes" imputed entries and compensates at test time G Varoquaux 18 35. 35. 2 Constant imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner "recognizes" imputed entries and compensates at test time Constant imputation breaks simple models (eg linear models) [Morvan... 2020] G Varoquaux 18 36. 36. 2 Imputation for supervised learning Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.65 0.70 0.75 0.80 r2 score Mean Iterative Convergence 0.725 0.750 0.775 r2 score Iterative Mean Small small size Notebook: github - @nprost / supervised missing Conclusions: IterativeImputer is useful for small sample sizes G Varoquaux 19 37. 37. 2 Imputation is not enough: predictive missingness Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = "Missing Not At Random" Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute (add indicator=True) ...IterativeImputer(add indicator=True) G Varoquaux 20 38. 38. 2 Imputation is not enough: predictive missingness Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = "Missing Not At Random" Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute (add indicator=True) ...IterativeImputer(add indicator=True) Simulation: y depends indirectly on missingness censoring 102 103 104 Sample size 0.75 0.80 0.85 0.90 0.95 r2 score Mean Mean+ indicator Iterative Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Iterative Mean+ indicator Mean Small small size Notebook: github - @nprost / supervised missing Adding a mask is crucial Iterative imputation can be detrimental G Varoquaux 20 39. 39. 2 Tree models with missing values MIA (Missing Incorporated Attribute) [Josse... 2019] x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn.ensemble.HistGradientBoostingClassifier The learner readily handles missing values G Varoquaux 21 40. 40. 2 Tree models with missing values (MCAR) Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.70 0.75 0.80 r2 score Inside trees Mean Iterative Convergence 0.75 0.80 r2 score Iterative Mean Inside trees Small small size Notebook: github - @nprost / supervised missing G Varoquaux 22 41. 41. 2 Tree models with missing values (censored) Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.7 0.8 0.9 r2 score Inside trees Mean Iterative Mean+ indicator Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Mean+ indicator Iterative Mean Inside trees Small small size Notebook: github - @nprost / supervised missing G Varoquaux 23 42. 42. 2 Neural networks with missing values Gradient-based optimization of continuous models Difficulty: Half-discrete input space (NA [?] R) Y = b? 1X1 + b? 2X2 + b? 0 cor(X1, X2) = 0.5. If X2 is missing, the coefficient of X1 should compensate for the missingness of X2. up to 2d set of slopes effect of X2lost effect of X2 accounted for by X1 G Varoquaux 24 43. 43. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of S-1 ). Taylored architecture which learns all slopes jointly G Varoquaux 25 44. 44. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of S-1 ). Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 -0.05 -0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less data G Varoquaux 25 45. 45. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of S-1 ). Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 -0.05 -0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less data Also suitable for MNAR settings G Varoquaux 25 46. 46. Learning with missing values Imputation is motivated only in MAR settings Rather than a sophisticated imputation, use a powerful supervised learner sklearn's HistGradientBoostingClassifier readily models missing values Can work in MNAR settings Different regime as standard statistics G Varoquaux 26 47. 47. 3 Machine learning on dirty categories [Cerda... 2018, Cerda and Varoquaux 2020] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I G Varoquaux 27 48. 48. 3 Categorical entries in a statistical model Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Master Police Officer Social Worker IV Police Officer II 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 One-hot encoding X [?] Rnxp G Varoquaux 28 49. 49. 3 Non-normalized categorical entries in a statistical model Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Break OneHotEncoder Overlapping categories "Master Police Officer", "Police Officer III", "Police Officer II"... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 "Architect III" New categories in test set G Varoquaux 29 50. 50. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II - average salary of policy officer II 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II G Varoquaux 30 51. 51. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II - average salary of policy officer II 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II Embedding closeby categories with the same y can help building a simple decision function. G Varoquaux 30 52. 52. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II - average salary of policy officer II DirtCat: Dirty category software: http:// dirty-cat.github.io from d i r t y c a t import TargetEncoder t a r g e t e n c o d e r = TargetEncoder () t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 30 53. 53. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III = Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 31 54. 54. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... = Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a "clean" database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision Potentially suboptimal Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 31 55. 55. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... = Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a "clean" database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Harder than supervised learning G Varoquaux 31 56. 56. Our goal: supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 32 57. 57. 3 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X [?] Rnxp new categories? link categories? Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 33 58. 58. 3 Some string similarities Levenshtein Number of edit on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m-t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... similarity = #n-gram in comon #n-gram in total G Varoquaux 34 59. 59. 3 Python implementation: DirtyCat DirtyCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import S i m i l a r i t y E n c o d e r s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y = 'ngram ') t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 35 60. 60. 3 Dirty categories blow up dimension G Varoquaux 36 61. 61. 3 Dirty categories blow up dimension New words in natural language G Varoquaux 36 62. 62. 3 Dirty categories blow up dimension New words in natural language X [?] Rnxp , p is large Statistical problems Computational problems G Varoquaux 36 63. 63. 3 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? G Varoquaux 37 64. 64. 3 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? All training-set = huge dimensionality Most frequent? Maybe the right prototypes / [?] training set "big cat" "fat cat" "big dog" "fat dog" Estimate prototypes G Varoquaux 37 65. 65. 3 Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 38 66. 66. 3 Modeling substrings [Cerda and Varoquaux 2020] Model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Models strings as a combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l sklearn.feature extraction.text CountVectorizer analyzer : 'word', 'char', 'char wb' HashingVectorizer fast, stateless TfidfVectorizer normalize counts G Varoquaux 39 67. 67. 3 Latent category model [Cerda and Varoquaux 2020] Topic model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l - 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l G Varoquaux 39 68. 68. 3 String models of latent categories [Cerda and Varoquaux 2020] Encodings that extract latent categories b r a r y r a t o r a l i s t h o u s e n a g e r u n i t y e s c u e f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e s Categories G Varoquaux 40 69. 69. 3 String models of latent categories [Cerda and Varoquaux 2020] Inferring plausible feature names s t a n t , l i b r a r y m e n t , o p e r a t o r o n , s p e c i a l i s t k e r , w a r e h o u s e o g r a m , m a n a g e r n i c , c o m m u n i t y e s c u e r , r e s c u e c t i o n , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e a t u r e n a m e s Categories G Varoquaux 40 70. 70. 3 Data science with dirty categories 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 41 71. 71. Learning does not require clean entities Model continuous similarities across entries Sub-string models can capture theses Requires a powerful statistical model (Gradient-boosted trees) Explainable machine-learning techniques to give insight G Varoquaux 42 72. 72. @GaelVaroquaux Machine learning with dirty data What models cannot fit Dirty categories Missing values Understanding and formatting data is unavoidable Master these aspects Powerful machine-learning models can cope with dirtyness - If it is well represented (representing similarities and missingness) - If they have supervision information 73. 73. 4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. Transactions in Data and Knowledge Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kegl. Similarity encoding for learning with dirty categorical variables. Machine Learning, 2018. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27-32, 2001. 74. 74. 4 References II M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISATS, 2020. D. B. Rubin. Inference and missing data. Biometrika, 63 (3):581-592, 1976. Recommended * Representation learning in limited-data settings Representation learning in limited-data settings Gael Varoquaux * AI, electronic records, and health AI, electronic records, and health Gael Varoquaux * Better neuroimaging data processing: driven by evidence, open communities, and careful engineering Better neuroimaging data processing: driven by evidence, open communities, an... Gael Varoquaux * Democratizing machine learning: perspective from scikit-learn Democratizing machine learning: perspective from scikit-learn Gael Varoquaux * Representation learning in limited-data settings Representation learning in limited-data settings Gael Varoquaux * Machine learning on non curated data Machine learning on non curated data Gael Varoquaux * Functional-connectome biomarkers to meet clinical needs? Functional-connectome biomarkers to meet clinical needs? Gael Varoquaux * Atlases of cognition with large-scale human brain mapping Atlases of cognition with large-scale human brain mapping Gael Varoquaux * Similarity encoding for learning on dirty categorical variables Similarity encoding for learning on dirty categorical variables Gael Varoquaux * Machine learning for functional connectomes Machine learning for functional connectomes Gael Varoquaux * [user-48x48] wcyee Oct. 27, 2021 These slides are a one-hour course on machine learning with non-curated data. According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning. Views Total views 10,447 On Slideshare 0 From embeds 0 Number of embeds 212 Actions Downloads 61 Shares 0 Comments 0 Likes 1 1 Likes Statistics Notes * About * Support * Terms * Privacy * Copyright English --------------------------------------------------------------------- (c) 2021 SlideShare from Scribd * * * English * Espanol * Portugues * Francais * Deutsch x Share Clipboard --------------------------------------------------------------------- x * Facebook * Twitter * LinkedIn Link [ ] Public clipboards featuring this slide --------------------------------------------------------------------- x No public clipboards found for this slide Select another clipboard --------------------------------------------------------------------- x Looks like you've clipped this slide to already. [ ] Create a clipboard You just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. --------------------------------------------------------------------- Name* [ ] Description [ ] Visibility Others can see my Clipboard [ ] Cancel Save Special Offer to SlideShare Readers x [slideshare] [plus] [scribd_log] [studying] Wait! Exclusive 60 day trial to the world's largest digital library. The SlideShare family just got bigger. You now have unlimited* access to books, audiobooks, magazines, and more from Scribd. Activate your free 60 day trial Cancel anytime.