Post Alx8ihvNMg9HTuiGUy by wollman@mastodon.social
(DIR) More posts by wollman@mastodon.social
(DIR) Post #Alx8ihvNMg9HTuiGUy by wollman@mastodon.social
2024-09-13T01:53:33Z
0 likes, 0 repeats
@apicultor Storage is astonishingly cheap compared to humans (unless you're renting it from Amazon).
(DIR) Post #Alx8iicclsk9e2wnuS by azonenberg@ioc.exchange
2024-09-13T04:01:53Z
0 likes, 0 repeats
@wollman @apicultor How much redundancy is there in your setup?My current Ceph cluster is all NVMe and 42TB of physical storage, but it was optimized for speed rather than capacity since I don't actually *need* that much space. So I went with 3-way replication.Now I just need to get a 25G capable switch because my dual 10G pipes from the cluster nodes to the network core feel a bit light...
(DIR) Post #Alx8ijN3zDtFy4ftIG by wollman@mastodon.social
2024-09-13T04:13:15Z
0 likes, 0 repeats
@azonenberg @apicultor Mostly mirrors (typically 28×2 with 4 spares), because RAID-Z2 gives us servers that are much too high capacity for the performance required. I just revised our spec earlier this year to bump up RAM and cache capacity and on the new architecture I've seen users fill a 25GbE link, which they couldn't on the older design.
(DIR) Post #Alx8ik8v7IAgMV46t6 by wollman@mastodon.social
2024-09-13T04:15:18Z
0 likes, 0 repeats
@azonenberg @apicultor Across servers the data is theoretically unique — there might be some common datasets if the users would actually share, but most of the data is ML training intermediates, model checkpoints and evaluations.
(DIR) Post #Alx8iktMKdJmgWnCGu by azonenberg@ioc.exchange
2024-09-13T04:18:15Z
0 likes, 0 repeats
@wollman @apicultor Yeah my nodes are physically capable of I think... nine? NVMe drives each if you load all the PCIe slots. I currently have 2x 3.48T M.2 and 1x 7.68T E1.S per node.And all of the eight 3.5" SATA/SAS bays are unused at the moment. I just don't need that much capacity, what I want is access to the data as quickly as possible.I'm building a new core router with a 100G NIC on it, once that's in service hanging off the existing 10/40G core switch I'm going to start looking at 25/100G switching options. Some of my endpoints have 25G cards in them already but are only lit up at 10G.
(DIR) Post #Alx8ileVVL232kqqlE by wollman@mastodon.social
2024-09-13T04:24:17Z
0 likes, 0 repeats
@azonenberg @apicultor I built a couple of scratch servers based on 32- and 48-drive NVMe chassis. General opinion is that it's not worth the expense: client performance is dominated by network delays, and users refuse to reorganize their code in a way that optimizes for network storage. Both those servers have 40G and never get anywhere near the limit.
(DIR) Post #Alx8ilerU1Jd3r18JU by wollman@mastodon.social
2024-09-13T04:17:39Z
0 likes, 0 repeats
@azonenberg @apicultor (oh, and hundreds of conda environments, which could probably be dedup'ed if dedup weren't such a performance disaster)
(DIR) Post #Alx8imNWnx2pINunw0 by azonenberg@ioc.exchange
2024-09-13T04:35:12Z
0 likes, 0 repeats
@wollman @apicultor I'm writing a lot of the code that pushes a lot of bandwidth and am always looking for opportunities to improve speed of the accesses.Right now I'm seeing bottlenecks somewhere (haven't figured out where) that are limiting me to about 16 Gbps in linear reads via "rados bench".I'm attempting to optimize the speed of loading in ngscopeclient because I hate waiting for applications. The problem is now that load times for my typical large datasets are in the few-second time frame - short enough to be difficult to get good benchmarks, but long enough to be annoying. I've seen actual application read rates burst to 5.9 Gbps but it flashed by too quickly for that to be a useful measurement (peak was probably higher).And of course I have to drop caches etc between each test to make sure I get useful benchmarks.
(DIR) Post #Alx8ims0ybQmowB9SC by wollman@mastodon.social
2024-09-13T04:41:20Z
0 likes, 0 repeats
@azonenberg @apicultor Anything that allows for streaming is going to perform better than Python ML code that reads a hundred million 50KiB JPGs in random order. I've tried to convince them to just put training data in a seekable archive format but they can't be bothered and I can't force them.
(DIR) Post #Alx8indW7zQdCGP5Um by azonenberg@ioc.exchange
2024-09-13T04:43:36Z
0 likes, 0 repeats
@wollman @apicultor Lol.Yeah the basic ngscopeclient file format is a top level YAML file with session metadata, filter graph topology, etc. then a folder containing some additional metadata files and a folder for each instrument, with a subfolder for each acquisition containing one or more binary files containing sample data.The sample data file is essentially just a float[] which might contain a few million up to about a billion elements, written out to disk as a single linear blob.
(DIR) Post #Alx8ioJhb9AlJ68mFU by azonenberg@ioc.exchange
2024-09-13T04:45:47Z
0 likes, 0 repeats
@wollman @apicultor I'm at the point that I'm looking at things like "can I dispatch multiple reads in parallel from different threads to increase throughput further" and where the tradeoff is between more parallelism and more overhead.Ultimately I'd love to be able to saturate the 40Gbps pipe for one second to load a 5GB dataset into the GPU and then start crunching it.
(DIR) Post #Alx8ioqJdtGCwFOp5E by ignaloidas@not.acu.lt
2024-09-13T06:57:09.177Z
0 likes, 0 repeats
@azonenberg@ioc.exchange @wollman@mastodon.social @apicultor@hachyderm.io Maybe look into data compression as well?@aras@mastodon.gamedev.place looked into compressing floats last year, and I'd think for your use case, using meshoptimizer could be pretty good for loading oscilloscope traces (I know that it's a tool for meshes, but it has a seemingly very good float compression/decompression inside of it too)https://web.archive.org/web/20240521194938/https://aras-p.info/blog/2023/02/02/Float-Compression-4-Mesh-Optimizer/ archive link since blog seems dead right now :(
(DIR) Post #AlxA0B8A86uiwLaXy4 by aras@mastodon.gamedev.place
2024-09-13T07:08:18Z
0 likes, 0 repeats
@ignaloidas @azonenberg @apicultor @wollman yeah my website right now is a bit down (server moves etc., gaah). But for "how to make float[] data smaller/faster" I'd look at Blosck https://www.blosc.org/c-blosc2/c-blosc2.html that is usable from C and also from Python. Has many filters/compressors for float data, including lossless and lossy ones.
(DIR) Post #AlxA0Bfq6tqucnLRSa by ignaloidas@not.acu.lt
2024-09-13T07:11:31.811Z
0 likes, 0 repeats
@aras@mastodon.gamedev.place @azonenberg@ioc.exchange @apicultor@hachyderm.io @wollman@mastodon.social right, but Blosc seemed a bit slower on the decompression front, and here the problem is slow data loads over the network. Meshoptimizer seems like it has a very good balance for that use case