[HN Gopher] Data Structures for Data-Intensive Applications [pdf...
___________________________________________________________________
Data Structures for Data-Intensive Applications [pdf] (2023)
Author : mfiguiere
Score : 496 points
Date : 2024-02-09 18:53 UTC (1 days ago)
(HTM) web link (cs-people.bu.edu)
(TXT) w3m dump (cs-people.bu.edu)
| Octokat wrote:
| Thank you for sharing this!
| datadrivenangel wrote:
| Data structures across distributed systems are challenging, so
| this is a useful resource.
|
| Still, better if you can keep your application on a single
| machine if that's feasible.
| teraflop wrote:
| This paper does not cover distributed systems, except for a
| very brief mention at the end.
| fithisux wrote:
| https://arxiv.org/abs/2001.04235
|
| here you are
| nwareing wrote:
| DSDIA ;)
| pgwhalen wrote:
| I've only had a chance to skim so far, but this is an absolutely
| superb survey of an enormous space. It doesn't just list out a
| bunch of data structures, it actully helps you build a mental
| model of all the concerns to think about when building or using
| data structures in an application.
| 3abiton wrote:
| That's the recap I needed!
| PH95VuimJjqBqy wrote:
| Easily one of the best technical books I've ever read.
| lgkk wrote:
| This is a good read
| russfink wrote:
| Needs a TOC
| albert_e wrote:
| I uploaded PDF and asked ChatGPT 4 for ToC ..it is really
| struggling to process this.
|
| Even after asking it to ignore page headers and footers.
|
| I thought the state of the art was much better?
| 082349872349872 wrote:
| By hand (5 min): I Introduction
| 2 II Performance Metrics and Operational Tradeoffs
| 9 III Dimensions of the Data Structure Design Space
| 21 IV From Workloads to Data Structures
| 67 V Adaptivity: Evolving Data Structures to a Workload
| 93 VI Data Structures for Specific Application Domains
| 109 VII Challenging Design Considerations
| 124 VIII Summary
| 132
|
| (along the way I saw some interesting figures, so, for me,
| this exercise was worth the time)
| albert_e wrote:
| Yes I was going to do the same but wanted to learn a
| general way to do this task along the way too and get more
| details (second level sections as well).
|
| ======================
|
| Step 1:
|
| Use pdftk to extract metadata
|
| > pdftk input.pdf dump_data output > raw.txt
|
| Step 2:
|
| copy relevant sections of metadata "Bookmarks" into a new
| tetx file.
|
| Step 3:
|
| Upload this text file and ask ChatGPT to format it as you
| want -- markdown / HTML / plain text.
|
| Took maybe 10 minutes and this was also worth it for a
| different reason (learning one more bit about pdftk
| capabilities).
|
| =======================
|
| CONTENTS 1. Introduction (2)
| 1.1. Data Structures Are Foundational (2) 1.2.
| Tradeoffs in Data Structure Design (4) 1.3.
| Audience & Prerequisites (7) 1.4. Learning Outcomes
| (8) 1.5. Overview of the Book (8) 2.
| Performance Metrics and Operational Tradeoffs (9)
| 2.1. Memory Hierarchy (9) 2.2. From Read/Update to
| RUM: Memory & Space Costs (10) 2.3. RUM Performance
| Costs (11) 2.4. From RUM to PyRUMID (14)
| 2.5. Chapter Summary (17) 2.6. Questions (18)
| 2.7. Further Readings (19) 3. Dimensions of
| the Data Structure Design Space (21) 3.1. Global
| Data Organization (24) 3.2. Global Search
| Algorithms When Not Using an Index (26) 3.3. Search
| When Using an Index (32) 3.4. Local Data
| Organization (46) 3.5. Local Search (48)
| 3.6. Modification Policy: In-place vs. Out-of-place (48)
| 3.7. Buffering (53) 3.8. Key-Value Representation
| (55) 3.9. Summary of the Design Space Dimensions
| (57) 3.10. Data Structure Design Expert Rules (58)
| 3.11. Chapter Summary (61) 3.12. Questions (61)
| 3.13. Further Readings (64) 4. From Workloads
| to Data Structures (67) 4.1. Point and Range
| Queries, Modifications, but Rare Scans (Traditional
| B+-trees and Learned Tree Indexes) (68) 4.2.
| Similar Workload With a Working Set That Fits in Memory
| (Fractal B+-trees) (70) 4.3. Point and Range
| Queries, Rare Scans, More Modifications (Insert-Optimized
| Search Trees) (70) 4.4. Mixed Workload With No
| Short Range Queries (Hybrid Range Trees) (75) 4.5.
| Mixed Workload, With Ever Increasing Data Size (Radix
| Trees) (78) 4.6. Point Queries, Inserts, and Some
| Modifications (Static Hashing with Overflow Pages) (79)
| 4.7. Read-mostly With Long Range Queries (Scans with
| Zonemaps) (81) 4.8. Modification-intensive With
| Point and Range Queries (LSM-tree) (82) 4.9.
| Modification-intensive With Point Queries Only (LSM-hash)
| (84) 4.10. When to Design Heterogeneous Data
| Structures (85) 4.11. Data Structures in Practice
| (86) 4.12. Chapter Summary (88) 4.13.
| Questions (89) 4.14. Further Readings (92)
| 5. Adaptivity: Evolving Data Structures to a Workload (93)
| 5.1. Design Dimension: Reorganization Aggressiveness (96)
| 5.2. Adaptivity for Frequently Accessed Data (96)
| 5.3. Adaptivity for Value-Organized Data (97) 5.4.
| Aggressiveness of Adaptivity during Initialization (100)
| 5.5. Partial Adaptive Indexing (101) 5.6. Adaptive
| Modifications (102) 5.7. Adaptivity and Concurrency
| (103) 5.8. Adaptivity Metrics (104) 5.9.
| Open Topics (105) 5.10. Chapter Summary (106)
| 5.11. Questions (107) 6. Data Structures for
| Specific Application Domains (109) 6.1. Data
| Structures in Relational Database Systems (109)
| 6.2. File Systems & Memory Management (117) 6.3.
| Data Structures in Machine Learning Pipelines (118)
| 6.4. Cross-System Design Considerations and Tradeoffs (120)
| 6.5. Chapter Summary (121) 6.6. Questions (122)
| 6.7. Further Readings (123) 7. Challenging
| Design Considerations (124) 7.1. Concurrency (125)
| 7.2. Distributed Systems (127) 7.3. Emerging
| Workload Types (127) 7.4. Hardware Considerations
| in Data Structure Implementation (128) 7.5.Chapter
| Summary (129) 7.6. Questions (129) 7.7.
| Further Readings (130) 8. Summary (132)
| Acknowledgments (134) References (135)
| csbartus wrote:
| Opening with Firefox gives a full TOC:
| https://imgur.com/a/cgdy0nY
| Ozzie_osman wrote:
| I'd love to buy a copy but it's $100 on Amazon.
| willsmith72 wrote:
| I'm still waiting for someone to disrupt and un-amazon the book
| industry. Authors lose, readers lose. It's broken
| eimrine wrote:
| Torrents can do this, also offline book shops who sell used
| books.
| petra wrote:
| One of the authors of this book runs a research lab in this
| field.
|
| They have this cool tool that can help one design an optimal data
| structure:
|
| http://daslab.seas.harvard.edu/datacalculator/
| 5Qn8mNbc2FNCiVV wrote:
| I can't seem to find where the actual tool is
| hermitcrab wrote:
| Same here.
| kinow wrote:
| Me too, some text, links, but no actual code or link to
| download.
| skavi wrote:
| If you manage to click on the correct image, you'll get to
| this page [0]. Then you can find a link to the "interactive
| demo" [1]. Weirdly well hidden.
|
| [0]: http://daslab.seas.harvard.edu/projects/evolution/
|
| [1]: http://sigmod.dyndns.org:8000/app/#
| bbsz wrote:
| More recommendations on this topic? Paper's great. I also know
| about Martin Klepmann's Designing data intensive applications but
| it's less about data structures and more about databases.
| phyrex wrote:
| I got a lot of value out of "Algorithms and Data Structures for
| Massive Datasets" (https://www.manning.com/books/algorithms-
| and-data-structures...)
| thesz wrote:
| It does not mention Array of Structures and Structure of Arrays
| dichotomy, which is extremely important if you design a structure
| to hold data for analytics of any kind.
___________________________________________________________________
(page generated 2024-02-10 23:01 UTC)