Post AxqI5M7hbAXIiIL0vw by bartvdbraak@mstdn.social
(DIR) More posts by bartvdbraak@mstdn.social
(DIR) Post #Axnrinvj3qevAbP01w by matrix@mastodon.matrix.org
0 likes, 0 repeats
the matrix.org homeserver is having problems: https://status.matrix.org/incidents/mm9hdm78svgv apologies for the inconvenience…
(DIR) Post #AxntE7GfJVkq4mguKO by lambda@chaosfurs.social
0 likes, 0 repeats
@matrix extra awkward when you want people to pay for it lol
(DIR) Post #Axntnn2I0qaRkkyf1U by matrix@mastodon.matrix.org
0 likes, 0 repeats
@lambda ❤️
(DIR) Post #AxnxPm7QAf6Vm7BLWK by matrix@mastodon.matrix.org
0 likes, 0 repeats
So: the matrix.org database secondary lost its FS due to a RAID failure earlier today (11:17 UTC). Then, we lost the primary at 17:26. We're trying to restore the primary DB FS (which could be fastish), while also doing a point-in-time backup restore from last night (which takes >10h). We believe the incremental DB traffic since last night is intact however. Apologies for the downtime; folks on their own homeserver are of course not impacted.
(DIR) Post #Axnz9PVwe6hUwYyegq by dminca@mastodon.social
0 likes, 0 repeats
@matrix good luck on the remediation actions 🫡#matrixdown
(DIR) Post #Axo0eOAJoJrbgwCHrc by hisold@toot.io
0 likes, 0 repeats
@matrix This is why we need more decentralization which happens to be the goal of matrix.
(DIR) Post #Axo1xv57kkIjEfe14q by vincep@piaille.fr
0 likes, 0 repeats
@matrix jokes aside, RAID failures are NOT fun. Props for the quick reaction and godpseed!
(DIR) Post #AxoBVFWNgA1TluMSWW by matrix@mastodon.matrix.org
0 likes, 0 repeats
Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic. Huge apologies for the outage. Again, folks using their own homeservers are not impacted.
(DIR) Post #AxoCGXpZQFW2ATtBUu by gwilymgj@mastodon.social
0 likes, 0 repeats
@matrix @lloydw !!! 👀
(DIR) Post #AxoDr2ofnQTEeBRCSm by Frisk@woof.tech
0 likes, 0 repeats
@matrix This screams to me as stressful 24 hours for infrastructure operators of matrix.org. Please accept complimentary hugs :blobfoxheart:
(DIR) Post #AxoGT99OtSaJCmQf3o by crispycat@mastodon.calitabby.net
0 likes, 0 repeats
@matrix
(DIR) Post #AxoKtqcgHqWpNU71hA by mk_emkapro@mastodon.social
0 likes, 0 repeats
@matrix better run it ur own, i have very good experience with #conduit server https://conduit.rs
(DIR) Post #AxoM1Bmw5o7d0CyQ1w by thibaultmol@en.osm.town
0 likes, 0 repeats
@matrix best wishes for the team working on the recovery!
(DIR) Post #AxoMjLWHGzfVvQEmC8 by feedmd@mastodon.social
0 likes, 0 repeats
@matrix never heard of a hot spare ?
(DIR) Post #AxoOaSpAwcSvsmy90y by scarpentier@pataterie.ca
0 likes, 0 repeats
@matrix #hugops
(DIR) Post #AxoQxARFfW0rQdjTZA by dustymabe@fosstodon.org
0 likes, 0 repeats
@matrix hugs for you all!
(DIR) Post #AxoUv0nwuqhMmrZp44 by AJCxZ0@fosstodon.org
0 likes, 0 repeats
@matrix Godspeed, admins!
(DIR) Post #AxonInLeOe9nB4pYUi by askaaron@troet.cafe
0 likes, 0 repeats
@matrix oh. that's a pity. Good luck for the repair and thanks for your work ❤️But this is also a good reminder to use your own server. I am totally new to Matrix but started using it with my own instance (based on Synapse) from the beginning.
(DIR) Post #AxopBl1WqB4ZqjBsxc by iooioio@fosstodon.org
0 likes, 0 repeats
@matrix Much love to the team. This incident is a reminder to me of how stable the service has been so far.
(DIR) Post #Axor5vCUI4otaCf596 by alwayscurious@infosec.exchange
0 likes, 0 repeats
@matrix What was the RAID failure? Have you considered using RAID-Z with ZFS?
(DIR) Post #AxowzhaiZl4ZMWPePQ by mrclon@mastodon.ml
0 likes, 0 repeats
@matrix it's remainder that Matrix network to concentrated on matrix.org.Use another homeservers, my dudes
(DIR) Post #AxoyYMCUCLcp6qAQAy by zacchiro@mastodon.xyz
0 likes, 0 repeats
@matrix as an advertisement for decentralization this is a bit harsh, but definitely effective!(J/k, of course. Good luck with the recovery and thanks!)
(DIR) Post #Axozdlmob6m1rgIwUa by codewiz@mstdn.io
0 likes, 0 repeats
@matrix Any plans to migrate away from centralized RDBMS? There are so many blob stores which can scale to petabytes and can tolerate the loss of multiple nodes without going offline.
(DIR) Post #Axp14VzTXA5oeEXk5w by matrix@mastodon.matrix.org
0 likes, 0 repeats
Status update: we’re 47TB through restoring the 55TB db snapshot of the matrix.org DB, but then have to rebuild the DB and replay the subsequent 17h of DB traffic, which will take several hours. Thank you for your patience, and apologies once again for the outage.
(DIR) Post #Axp2hNOVWoXVEdLWkK by beta3@mastodon.xyz
0 likes, 0 repeats
@matrix Thanks for the status update and all the work for getting it going again!
(DIR) Post #AxpK69k37j4bLvp99M by T_X@chaos.social
0 likes, 0 repeats
@matrix weirdly this feels like actually a positive example reinforcing the idea of a decentral fediverse, as other instances are unaffected. Also we had been discussing running an own instance at the @chaotikumev just before the outage.I just wish there were such an easy, neat account migration feature like @Mastodon has. (And I guess I can't just ex- and import chats + keys and use SRV records to have a seamless migration?)
(DIR) Post #AxpLGHCKMBHbKMRMSO by matrix@mastodon.matrix.org
0 likes, 0 repeats
Status update: we've restored the 55TB snapshot and subsequent incremental backups, and are about to replay the remaining traffic since the backup. There are still several unknowns, but if things go well the matrix.org instance should be back in 3-4 hours.
(DIR) Post #AxpOCc2dDblCg1qkN6 by mdione@en.osm.town
0 likes, 0 repeats
@matrix just an idea to improve backups:Make exponential backoff like backups: last month, months 2-3 ago, mos 4-6 ago, 7-12moa, 2-3ya, etc. Or with N messages instead of N days.Sounds like you could recover the fresher data first, then catch up, then restore backwards.#backup #SysAdmin
(DIR) Post #AxploAvKmtdaWTJhVQ by hub@cosocial.ca
0 likes, 0 repeats
@matrix that page sends us to BluSky....
(DIR) Post #Axpo2axWz8fXEULdPU by MazharHussain@techhub.social
0 likes, 0 repeats
@matrix oh.. So, that's why it's not working. Good luck 🤞.
(DIR) Post #AxppTHB6jqLPeiz6sy by Kalos@mstdn.social
0 likes, 0 repeats
ya queda menos... vaya fastidio.
(DIR) Post #Axps3tdRQvsVGGpLZA by politipet@piaille.fr
0 likes, 0 repeats
@matrix so you're back online it seems. Thanks 👍 😘
(DIR) Post #AxpvLlCXpQpWBBSLVw by matrix@mastodon.matrix.org
0 likes, 0 repeats
Right, matrix.org is back online as of 17:00 UTC. The server is struggling a bit as it catches up. Huge apologies again for the outage; postmortem + ways to avoid a repeat will be forthcoming. See also https://www.theregister.com/2025/09/03/matrixorg_raid_failure/ & https://www.heise.de/en/news/Matrix-main-server-down-millions-of-users-affected-10630524.html. Thanks all for your patience.
(DIR) Post #Axpw2DYWnUIqHXz4C0 by kontrollierterWahnwitz@sueden.social
0 likes, 0 repeats
@matrix I’m really interested in your post mortem from a professional point of view.
(DIR) Post #AxpwjX4c6iLfGae6TY by altf4@hostux.social
0 likes, 0 repeats
@matrix welcome back !
(DIR) Post #Axpxc5JR1VjH5vah7I by vincep@piaille.fr
0 likes, 0 repeats
@matrix I should really grab that funny domain name I've been eyeing and host my own instance.
(DIR) Post #Axpy1vrWmUsfJZrXfc by CyReVolt@mastodon.social
0 likes, 0 repeats
@matrix 🥺That must have been rough and tough.We love you! 🧡
(DIR) Post #Axq0Y9h95TpNOHV1F2 by cavallo_pazzo@toot.community
0 likes, 0 repeats
@matrix Thank you!
(DIR) Post #Axq6iRryy5oGNarLnM by thomy2000@fosstodon.org
0 likes, 0 repeats
@matrix Thanks to all the incredible people at Matrix who managed to fix this. This must have been a horrible, stressful day.
(DIR) Post #Axq7TM7o92LqAmd3C4 by amythegay@estrogen.network
0 likes, 0 repeats
@matrix
(DIR) Post #Axq7TMiJxHYg01iD6e by matrix@mastodon.matrix.org
0 likes, 0 repeats
@amythegay matrix 1: liz 0
(DIR) Post #AxqFuhnsS4RUrMrJ2W by THB_STX@infosec.exchange
0 likes, 0 repeats
@matrix On my end, I still have issues when trying to log in.
(DIR) Post #AxqGqMTxshiVWs2Npw by matrix@mastodon.matrix.org
0 likes, 0 repeats
@THB_STX we’re not aware of any issues - can you send details to support@matrix.org please?
(DIR) Post #AxqI5M7hbAXIiIL0vw by bartvdbraak@mstdn.social
0 likes, 0 repeats
@matrix props to the transparency and my well wishes and a good night sleep to all engineers involved❤️
(DIR) Post #AxrRjpfSGY0IflSdjE by penguin42@mastodon.org.uk
0 likes, 0 repeats
@matrix Well at least it wasn't the xmas holidays this time 🙂
(DIR) Post #Axs2q9Q7R42pyZADRY by AJCxZ0@fosstodon.org
0 likes, 1 repeats
Congratulation on the recovery, @matrix While the postmortem should focus on what went wrong and how any likely reoccurrence of failures can be mitigated at acceptable cost, be sure to celebrate the successful recovery from catastrophic failure in production *without loss of data*, including meaningful communication to us.Many organisations with far more resources and responsibilities fail to achieve even a fraction of this.
(DIR) Post #AxvKYIfsffY8783Ojo by mr_creosote@dosgame.club
0 likes, 0 repeats
@matrix Thank you for taking this gargantuan effort of restoration! It seems the Afternet bridge is still down. Even the channel search answers with an error. Any chance this could be restored?
(DIR) Post #Axw2Sb5ZgpPfTGA9bM by matrix@mastodon.matrix.org
0 likes, 0 repeats
@mr_creosote hm, we don’t run an afternet bridge as matrix.org; it must be run by someone else who you’ll need to nudge - sorry!
(DIR) Post #AzkC7TukzHtD7y32EC by AJCxZ0@fosstodon.org
0 likes, 0 repeats
@matrixThank you for sharing this illuminating post-mortem[1] of an unlikely and unfortunate combination of hardware and human errors, handling these very well in a real environment in which little outstanding matters become much less little, including a very practical list of lessons learned. Good job.Concerning the lessons, I must caution you strongly against one idea: never alias basic commands such as "rm". Wrapping using a new name, e.g. "del", is fine.[1] https://matrix.org/blog/2025/10/post-mortem/