fsebugoutzone.org:9999

       Post Alx8ihvNMg9HTuiGUy by wollman@mastodon.social
 (DIR) More posts by wollman@mastodon.social
 (DIR) Post #Alx8ihvNMg9HTuiGUy by wollman@mastodon.social
       2024-09-13T01:53:33Z
       
       0 likes, 0 repeats
       
       @apicultor Storage is astonishingly cheap compared to humans (unless you&#39;re renting it from Amazon).
       
 (DIR) Post #Alx8iicclsk9e2wnuS by azonenberg@ioc.exchange
       2024-09-13T04:01:53Z
       
       0 likes, 0 repeats
       
       @wollman @apicultor How much redundancy is there in your setup?My current Ceph cluster is all NVMe and 42TB of physical storage, but it was optimized for speed rather than capacity since I don&#39;t actually *need* that much space. So I went with 3-way replication.Now I just need to get a 25G capable switch because my dual 10G pipes from the cluster nodes to the network core feel a bit light...
       
 (DIR) Post #Alx8ijN3zDtFy4ftIG by wollman@mastodon.social
       2024-09-13T04:13:15Z
       
       0 likes, 0 repeats
       
       @azonenberg @apicultor Mostly mirrors (typically 28×2 with 4 spares), because RAID-Z2 gives us servers that are much too high capacity for the performance required. I just revised our spec earlier this year to bump up RAM and cache capacity and on the new architecture I&#39;ve seen users fill a 25GbE link, which they couldn&#39;t on the older design.
       
 (DIR) Post #Alx8ik8v7IAgMV46t6 by wollman@mastodon.social
       2024-09-13T04:15:18Z
       
       0 likes, 0 repeats
       
       @azonenberg @apicultor Across servers the data is theoretically unique — there might be some common datasets if the users would actually share, but most of the data is ML training intermediates, model checkpoints and evaluations.
       
 (DIR) Post #Alx8iktMKdJmgWnCGu by azonenberg@ioc.exchange
       2024-09-13T04:18:15Z
       
       0 likes, 0 repeats
       
       @wollman @apicultor Yeah my nodes are physically capable of I think... nine? NVMe drives each if you load all the PCIe slots. I currently have 2x 3.48T M.2 and 1x 7.68T E1.S per node.And all of the eight 3.5&quot; SATA/SAS bays are unused at the moment. I just don&#39;t need that much capacity, what I want is access to the data as quickly as possible.I&#39;m building a new core router with a 100G NIC on it, once that&#39;s in service hanging off the existing 10/40G core switch I&#39;m going to start looking at 25/100G switching options. Some of my endpoints have 25G cards in them already but are only lit up at 10G.
       
 (DIR) Post #Alx8ileVVL232kqqlE by wollman@mastodon.social
       2024-09-13T04:24:17Z
       
       0 likes, 0 repeats
       
       @azonenberg @apicultor I built a couple of scratch servers based on 32- and 48-drive NVMe chassis. General opinion is that it&#39;s not worth the expense: client performance is dominated by network delays, and users refuse to reorganize their code in a way that optimizes for network storage. Both those servers have 40G and never get anywhere near the limit.
       
 (DIR) Post #Alx8ilerU1Jd3r18JU by wollman@mastodon.social
       2024-09-13T04:17:39Z
       
       0 likes, 0 repeats
       
       @azonenberg @apicultor (oh, and hundreds of conda environments, which could probably be dedup&#39;ed if dedup weren&#39;t such a performance disaster)
       
 (DIR) Post #Alx8imNWnx2pINunw0 by azonenberg@ioc.exchange
       2024-09-13T04:35:12Z
       
       0 likes, 0 repeats
       
       @wollman @apicultor I&#39;m writing a lot of the code that pushes a lot of bandwidth and am always looking for opportunities to improve speed of the accesses.Right now I&#39;m seeing bottlenecks somewhere (haven&#39;t figured out where) that are limiting me to about 16 Gbps in linear reads via &quot;rados bench&quot;.I&#39;m attempting to optimize the speed of loading in ngscopeclient because I hate waiting for applications. The problem is now that load times for my typical large datasets are in the few-second time frame - short enough to be difficult to get good benchmarks, but long enough to be annoying. I&#39;ve seen actual application read rates burst to 5.9 Gbps but it flashed by too quickly for that to be a useful measurement (peak was probably higher).And of course I have to drop caches etc between each test to make sure I get useful benchmarks.
       
 (DIR) Post #Alx8ims0ybQmowB9SC by wollman@mastodon.social
       2024-09-13T04:41:20Z
       
       0 likes, 0 repeats
       
       @azonenberg @apicultor Anything that allows for streaming is going to perform better than Python ML code that reads a hundred million 50KiB JPGs in random order. I&#39;ve tried to convince them to just put training data in a seekable archive format but they can&#39;t be bothered and I can&#39;t force them.
       
 (DIR) Post #Alx8indW7zQdCGP5Um by azonenberg@ioc.exchange
       2024-09-13T04:43:36Z
       
       0 likes, 0 repeats
       
       @wollman @apicultor Lol.Yeah the basic ngscopeclient file format is a top level YAML file with session metadata, filter graph topology, etc. then a folder containing some additional metadata files and a folder for each instrument, with a subfolder for each acquisition containing one or more binary files containing sample data.The sample data file is essentially just a float[] which might contain a few million up to about a billion elements, written out to disk as a single linear blob.
       
 (DIR) Post #Alx8ioJhb9AlJ68mFU by azonenberg@ioc.exchange
       2024-09-13T04:45:47Z
       
       0 likes, 0 repeats
       
       @wollman @apicultor I&#39;m at the point that I&#39;m looking at things like &quot;can I dispatch multiple reads in parallel from different threads to increase throughput further&quot; and where the tradeoff is between more parallelism and more overhead.Ultimately I&#39;d love to be able to saturate the 40Gbps pipe for one second to load a 5GB dataset into the GPU and then start crunching it.
       
 (DIR) Post #Alx8ioqJdtGCwFOp5E by ignaloidas@not.acu.lt
       2024-09-13T06:57:09.177Z
       
       0 likes, 0 repeats
       
       @azonenberg@ioc.exchange @wollman@mastodon.social @apicultor@hachyderm.io Maybe look into data compression as well?@aras@mastodon.gamedev.place looked into compressing floats last year, and I&#39;d think for your use case, using meshoptimizer could be pretty good for loading oscilloscope traces (I know that it&#39;s a tool for meshes, but it has a seemingly very good float compression/decompression inside of it too)https://web.archive.org/web/20240521194938/https://aras-p.info/blog/2023/02/02/Float-Compression-4-Mesh-Optimizer/ archive link since blog seems dead right now :(
       
 (DIR) Post #AlxA0B8A86uiwLaXy4 by aras@mastodon.gamedev.place
       2024-09-13T07:08:18Z
       
       0 likes, 0 repeats
       
       @ignaloidas @azonenberg @apicultor @wollman yeah my website right now is a bit down (server moves etc., gaah). But for &quot;how to make float[] data smaller/faster&quot; I&#39;d look at Blosck https://www.blosc.org/c-blosc2/c-blosc2.html that is usable from C and also from Python. Has many filters/compressors for float data, including lossless and lossy ones.
       
 (DIR) Post #AlxA0Bfq6tqucnLRSa by ignaloidas@not.acu.lt
       2024-09-13T07:11:31.811Z
       
       0 likes, 0 repeats
       
       @aras@mastodon.gamedev.place @azonenberg@ioc.exchange @apicultor@hachyderm.io @wollman@mastodon.social right, but Blosc seemed a bit slower on the decompression front, and here the problem is slow data loads over the network. Meshoptimizer seems like it has a very good balance for that use case