[HN Gopher] Data Structures for Data-Intensive Applications [pdf...
       ___________________________________________________________________
        
       Data Structures for Data-Intensive Applications [pdf] (2023)
        
       Author : mfiguiere
       Score  : 496 points
       Date   : 2024-02-09 18:53 UTC (1 days ago)
        
 (HTM) web link (cs-people.bu.edu)
 (TXT) w3m dump (cs-people.bu.edu)
        
       | Octokat wrote:
       | Thank you for sharing this!
        
       | datadrivenangel wrote:
       | Data structures across distributed systems are challenging, so
       | this is a useful resource.
       | 
       | Still, better if you can keep your application on a single
       | machine if that's feasible.
        
         | teraflop wrote:
         | This paper does not cover distributed systems, except for a
         | very brief mention at the end.
        
           | fithisux wrote:
           | https://arxiv.org/abs/2001.04235
           | 
           | here you are
        
       | nwareing wrote:
       | DSDIA ;)
        
       | pgwhalen wrote:
       | I've only had a chance to skim so far, but this is an absolutely
       | superb survey of an enormous space. It doesn't just list out a
       | bunch of data structures, it actully helps you build a mental
       | model of all the concerns to think about when building or using
       | data structures in an application.
        
         | 3abiton wrote:
         | That's the recap I needed!
        
         | PH95VuimJjqBqy wrote:
         | Easily one of the best technical books I've ever read.
        
       | lgkk wrote:
       | This is a good read
        
       | russfink wrote:
       | Needs a TOC
        
         | albert_e wrote:
         | I uploaded PDF and asked ChatGPT 4 for ToC ..it is really
         | struggling to process this.
         | 
         | Even after asking it to ignore page headers and footers.
         | 
         | I thought the state of the art was much better?
        
           | 082349872349872 wrote:
           | By hand (5 min):                 I Introduction
           | 2       II Performance Metrics and Operational Tradeoffs
           | 9       III Dimensions of the Data Structure Design Space
           | 21       IV From Workloads to Data Structures
           | 67       V Adaptivity: Evolving Data Structures to a Workload
           | 93       VI Data Structures for Specific Application Domains
           | 109       VII Challenging Design Considerations
           | 124       VIII Summary
           | 132
           | 
           | (along the way I saw some interesting figures, so, for me,
           | this exercise was worth the time)
        
             | albert_e wrote:
             | Yes I was going to do the same but wanted to learn a
             | general way to do this task along the way too and get more
             | details (second level sections as well).
             | 
             | ======================
             | 
             | Step 1:
             | 
             | Use pdftk to extract metadata
             | 
             | > pdftk input.pdf dump_data output > raw.txt
             | 
             | Step 2:
             | 
             | copy relevant sections of metadata "Bookmarks" into a new
             | tetx file.
             | 
             | Step 3:
             | 
             | Upload this text file and ask ChatGPT to format it as you
             | want -- markdown / HTML / plain text.
             | 
             | Took maybe 10 minutes and this was also worth it for a
             | different reason (learning one more bit about pdftk
             | capabilities).
             | 
             | =======================
             | 
             | CONTENTS                   1. Introduction (2)
             | 1.1. Data Structures Are Foundational (2)         1.2.
             | Tradeoffs in Data Structure Design (4)         1.3.
             | Audience & Prerequisites (7)         1.4. Learning Outcomes
             | (8)         1.5. Overview of the Book (8)              2.
             | Performance Metrics and Operational Tradeoffs (9)
             | 2.1. Memory Hierarchy (9)         2.2. From Read/Update to
             | RUM: Memory & Space Costs (10)         2.3. RUM Performance
             | Costs (11)         2.4. From RUM to PyRUMID (14)
             | 2.5. Chapter Summary (17)         2.6. Questions (18)
             | 2.7. Further Readings (19)              3. Dimensions of
             | the Data Structure Design Space (21)         3.1. Global
             | Data Organization (24)         3.2. Global Search
             | Algorithms When Not Using an Index (26)         3.3. Search
             | When Using an Index (32)         3.4. Local Data
             | Organization (46)         3.5. Local Search (48)
             | 3.6. Modification Policy: In-place vs. Out-of-place (48)
             | 3.7. Buffering (53)         3.8. Key-Value Representation
             | (55)         3.9. Summary of the Design Space Dimensions
             | (57)         3.10. Data Structure Design Expert Rules (58)
             | 3.11. Chapter Summary (61)         3.12. Questions (61)
             | 3.13. Further Readings (64)              4. From Workloads
             | to Data Structures (67)         4.1. Point and Range
             | Queries, Modifications, but Rare Scans (Traditional
             | B+-trees and Learned Tree Indexes) (68)         4.2.
             | Similar Workload With a Working Set That Fits in Memory
             | (Fractal B+-trees) (70)         4.3. Point and Range
             | Queries, Rare Scans, More Modifications (Insert-Optimized
             | Search Trees) (70)         4.4. Mixed Workload With No
             | Short Range Queries (Hybrid Range Trees) (75)         4.5.
             | Mixed Workload, With Ever Increasing Data Size (Radix
             | Trees) (78)         4.6. Point Queries, Inserts, and Some
             | Modifications (Static Hashing with Overflow Pages) (79)
             | 4.7. Read-mostly With Long Range Queries (Scans with
             | Zonemaps) (81)         4.8. Modification-intensive With
             | Point and Range Queries (LSM-tree) (82)         4.9.
             | Modification-intensive With Point Queries Only (LSM-hash)
             | (84)         4.10. When to Design Heterogeneous Data
             | Structures (85)         4.11. Data Structures in Practice
             | (86)         4.12. Chapter Summary (88)         4.13.
             | Questions (89)         4.14. Further Readings (92)
             | 5. Adaptivity: Evolving Data Structures to a Workload (93)
             | 5.1. Design Dimension: Reorganization Aggressiveness (96)
             | 5.2. Adaptivity for Frequently Accessed Data (96)
             | 5.3. Adaptivity for Value-Organized Data (97)         5.4.
             | Aggressiveness of Adaptivity during Initialization (100)
             | 5.5. Partial Adaptive Indexing (101)         5.6. Adaptive
             | Modifications (102)         5.7. Adaptivity and Concurrency
             | (103)         5.8. Adaptivity Metrics (104)         5.9.
             | Open Topics (105)         5.10. Chapter Summary (106)
             | 5.11. Questions (107)              6. Data Structures for
             | Specific Application Domains (109)         6.1. Data
             | Structures in Relational Database Systems (109)
             | 6.2. File Systems & Memory Management (117)         6.3.
             | Data Structures in Machine Learning Pipelines (118)
             | 6.4. Cross-System Design Considerations and Tradeoffs (120)
             | 6.5. Chapter Summary (121)         6.6. Questions (122)
             | 6.7. Further Readings (123)              7. Challenging
             | Design Considerations (124)         7.1. Concurrency (125)
             | 7.2. Distributed Systems (127)         7.3. Emerging
             | Workload Types (127)         7.4. Hardware Considerations
             | in Data Structure Implementation (128)         7.5.Chapter
             | Summary (129)         7.6. Questions (129)         7.7.
             | Further Readings (130)              8. Summary (132)
             | Acknowledgments (134)         References (135)
        
         | csbartus wrote:
         | Opening with Firefox gives a full TOC:
         | https://imgur.com/a/cgdy0nY
        
       | Ozzie_osman wrote:
       | I'd love to buy a copy but it's $100 on Amazon.
        
         | willsmith72 wrote:
         | I'm still waiting for someone to disrupt and un-amazon the book
         | industry. Authors lose, readers lose. It's broken
        
           | eimrine wrote:
           | Torrents can do this, also offline book shops who sell used
           | books.
        
       | petra wrote:
       | One of the authors of this book runs a research lab in this
       | field.
       | 
       | They have this cool tool that can help one design an optimal data
       | structure:
       | 
       | http://daslab.seas.harvard.edu/datacalculator/
        
         | 5Qn8mNbc2FNCiVV wrote:
         | I can't seem to find where the actual tool is
        
           | hermitcrab wrote:
           | Same here.
        
           | kinow wrote:
           | Me too, some text, links, but no actual code or link to
           | download.
        
           | skavi wrote:
           | If you manage to click on the correct image, you'll get to
           | this page [0]. Then you can find a link to the "interactive
           | demo" [1]. Weirdly well hidden.
           | 
           | [0]: http://daslab.seas.harvard.edu/projects/evolution/
           | 
           | [1]: http://sigmod.dyndns.org:8000/app/#
        
       | bbsz wrote:
       | More recommendations on this topic? Paper's great. I also know
       | about Martin Klepmann's Designing data intensive applications but
       | it's less about data structures and more about databases.
        
         | phyrex wrote:
         | I got a lot of value out of "Algorithms and Data Structures for
         | Massive Datasets" (https://www.manning.com/books/algorithms-
         | and-data-structures...)
        
       | thesz wrote:
       | It does not mention Array of Structures and Structure of Arrays
       | dichotomy, which is extremely important if you design a structure
       | to hold data for analytics of any kind.
        
       ___________________________________________________________________
       (page generated 2024-02-10 23:01 UTC)