[HN Gopher] PCIe trouble with 4TB Crucial T500 NVMe SSD for >1 p...
       ___________________________________________________________________
        
       PCIe trouble with 4TB Crucial T500 NVMe SSD for >1 power cycle on
       MSI PRO X670-P
        
       Author : transpute
       Score  : 159 points
       Date   : 2024-12-28 03:04 UTC (1 days ago)
        
 (HTM) web link (forum.level1techs.com)
 (TXT) w3m dump (forum.level1techs.com)
        
       | tfwnopmt wrote:
       | HDMI provides power - that's how old chromecasts can work without
       | a separate power plug.
       | 
       | The comment about NPNs and PNPs is garbage, but there is a design
       | fault with the board - it shouldn't allow HDMI power to flow
       | backwards into the motherboard when the motherboard shuts off.
       | That would likely cause a power rail sequencing issue on the
       | board or SSD, leading to latch-up of various ICs, and non-
       | detection on the SSD on the flowing bootup
        
         | LeifCarrotson wrote:
         | And by "the board" I trust you mean the MSI PRO X670-P WIFI
         | motherboard.
         | 
         | There's nothing incorrect about the behavior of the SSD when
         | it's being operated outside the prescribed voltage and power
         | thresholds.
         | 
         | If there's a trickle (and to be clear, the 5V at 300 mA
         | available from an HDMI cable is a trickle for a full
         | motherboard) of current into the 3V3 bus on the ATX connector,
         | _something_ will be the very lowest PMIC to turn on. It 's just
         | that on this system, the SSD was the first thing. If anything,
         | the SSD will probably be highly tolerant of brownouts because
         | its LDO will run at around 1.9V.
        
           | hulitu wrote:
           | > There's nothing incorrect about the behavior of the SSD
           | when it's being operated outside the prescribed voltage and
           | power thresholds.
           | 
           | It shall set itself in Reset state.
        
             | LeifCarrotson wrote:
             | That would be nice, in practice, the SSD requires its power
             | rails to start up in a particular sequence and with very
             | particular voltages.
        
             | shadowpho wrote:
             | Only few devices are actually able to do that. Vast
             | majority require require proper voltage sequencing, because
             | to do otherwise is to add cost to your IC
        
           | Dylan16807 wrote:
           | > There's nothing incorrect about the behavior of the SSD
           | when it's being operated outside the prescribed voltage and
           | power thresholds.
           | 
           | I'd put some more emphasis on "when", though. If it never
           | comes back when power comes back that's not particularly
           | correct.
        
             | crest wrote:
             | That's because if this theory is correct from the point of
             | view of the SSD there was no reboot yet, because there was
             | never any total power loss.
        
               | Dylan16807 wrote:
               | It handles warm reboots without power loss just fine, so
               | it deciding now it needs to wait for power loss seems
               | like a flaw.
        
               | wtallis wrote:
               | If the SSD reacts to the start of a brown-out with supply
               | voltage dropping way below spec as a signal that an
               | unplanned power loss is happening, then it may do an
               | emergency flush and shutdown that leaves it simply
               | waiting for power to finish dropping to zero. It makes at
               | least some sense for the drive to not try to wake up from
               | that state without a clean power cycle.
        
               | Dylan16807 wrote:
               | I think "makes at least some sense" and "not particularly
               | correct" can be true at the same time.
        
               | smileybarry wrote:
               | It should still handle PCIe probing and (logical)
               | reconnection without a reboot, though, e.g.: PCIe
               | redirection for a VM.
        
         | magic_smoke_ee wrote:
         | The reality is retail PC electronics, like much consumer
         | electronics with short lifespans, are designed/engineered and
         | manufactured more-or-less like disposable e-waste garbage.
         | Eevblog Dave or Bigclive might be able to get to the bottom of
         | the circuit or manufacturing design error, albeit with some
         | help if it turns out to be a digital-or-up-the-stack issue.
        
           | KeplerBoy wrote:
           | meh, I rarely have electronics fail these days. Whatever
           | corners designers are cutting seem perfectly adequate to be
           | cut to make stuff affordable.
        
             | lazide wrote:
             | The rise of mass produced cheap ICs with somewhat
             | reasonable behavior are the cause. It's cheap to add some
             | logic to something when you're making a million or more of
             | them, than when it's an additional couple discrete
             | components and an additional circuit you need to add
             | yourself.
        
         | gbil wrote:
         | >HDMI provides power - that's how old chromecasts can work
         | without a separate power plug.
         | 
         | I still have the first Chromecast released, it doesn't operate
         | without external power plugged in so I'm not sure about the
         | validity of your comment, at least for the chromecast part
        
           | bradfitz wrote:
           | https://www.hdmi.org/spec21sub/cablepower
        
             | rzzzt wrote:
             | Connection is the same as attaching an ordinary, "wired"
             | HDMI Cable, except        that active cables can only be
             | attached in one direction: One end of the cable        is
             | specifically labeled for attachment to the HDMI Source
             | (transmitting)        device, and the other end of the
             | cable must be attached to the HDMI Sink        (receiving)
             | device. If the cable is attached in reverse, no damage will
             | occur,        but the connection will not work.
             | HDMI Cables with HDMI Cable Power include a separate power
             | connector for use        with source devices that do not
             | support the HDMI Cable Power feature.
             | 
             | This is not your run-of-the-mill HDMI cable for sure.
        
             | numpad0 wrote:
             | No, not that feature. HDMI supported 5V/55mA power out for
             | years. It's meant for EDID ROM chips and maybe HDMI
             | selectors too, not Linux based computers, but some TVs
             | could take it in gross violation of specifications and its
             | spirits.
        
           | nosrepa wrote:
           | And the serial number of that power plug is MST3K-US
        
           | kuschku wrote:
           | The first chromecast actually operated without external
           | power, but it only worked with some TVs.
           | 
           | It's possible yours didn't provide enough power via HDMI, but
           | at least ours worked just fine.
        
             | ssl-3 wrote:
             | It is possible that your memory of a device from a decade
             | ago is faulty. No Chromecast has ever been able to be
             | powered by HDMI alone. That has never been a thing.
             | 
             | You may instead by remembering the fact only some TVs back
             | then were successful at powering the Chromecast without an
             | external power brick, using a USB port on the TV itself to
             | power up the Chromecast.
             | 
             | In applications where this worked (and it often did work,
             | although it also often did not work), it could provide a
             | solution that existed entirely on the back of the TV with
             | nothing additional plugged into the wall.
             | 
             | But it was still [micro] USB that provided the power to the
             | OG streaming stick, not HDMI.
        
               | kuschku wrote:
               | > It is possible that your memory of a device from a
               | decade ago is faulty. No Chromecast has ever been able to
               | be powered by HDMI alone. That has never been a thing.
               | 
               | It is not - I still use my 11yo Chromecast Gen1 today.
               | And it still works fine without USB power (as long as you
               | don't try to play YouTube videos).
        
               | altcognito wrote:
               | I also had this device and would concur it was supposed
               | to work without USB power, but in my experience worked
               | extremely poorly.
        
               | lightedman wrote:
               | "You may instead by remembering the fact only some TVs
               | back then were successful at powering the Chromecast
               | without an external power brick, using a USB port on the
               | TV itself to power up the Chromecast."
               | 
               | I'm looking at my first gen plugged into the ARC HDMI
               | port on my Vizio TV. It is ONLY attached to the HDMI port
               | and nothing else.
        
               | 486sx33 wrote:
               | +1 my visio powers this as well It also powers lots of
               | stuff via usb
               | 
               | Maybe because it's NOT a smart tv and doesn't have some
               | crazy android chip SoC to constantly power. I mean
               | obviously you can make a power supply that could do both
               | - or neither. But it likely comes down to price for the
               | manufacturer of the tv
        
             | smileybarry wrote:
             | Right, but I think it wasn't a real intended use case and
             | that some TVs provided amperage over the spec (maybe by
             | accident? simpler circuit bridging the same power pin for
             | USB and HDMI?).
             | 
             | I had the same first gen Chromecast (may even have it lying
             | around somewhere) but it came with explicit directions to
             | use the included power cable, so maybe they updated the
             | included guide some time after release.
        
               | photon_rancher wrote:
               | They probably just provide extra power over the port. It
               | costs extra to design an extra supply for a specific port
               | so it's probably shared, and likewise also costs extra to
               | current limit each port. So more than likely a cost
               | saving measure
        
         | ssl-3 wrote:
         | HDMI does provide power, but this is not how Chromecast (or
         | similar) devices have ever been powered.
         | 
         | It supplies 5v at up to 50mA from a sink device like a TV.
         | 
         | That's only a quarter of a Watt, which is perhaps enough for
         | something like an EDID ROM, or maybe a switch or perhaps an
         | extender. It is not enough power to run a Chromecast.
         | 
         | HDMI 2.1b Amendment 1 [0] can supply up to 300mA at 5v, but
         | that specification is only a year or so old. It requires a
         | special cable. And 1.5 Watts maximum isn't enough to run a
         | Chromecast, either. (The intent is to be able to use it to run
         | a somewhat thirstier extender than the earlier specifications
         | would permit.)
         | 
         | 0: https://www.hdmi.org/spec21sub/cablepower
        
           | kalleboo wrote:
           | > _It supplies 5v at up to 50mA from a sink device like a
           | TV._
           | 
           | And USB is also only supposed to supply 100 mA until the
           | device negotiates for more.
           | 
           | But literally every device in the real-world just wires the
           | port to the 5V rail with 2 A overcurrent protection and your
           | "dumb" USB-powered fan gadget can draw as much as it wants
           | without any negotiation.
           | 
           | I can totally see TVs doing the same
        
             | mschuster91 wrote:
             | > But literally every device in the real-world just wires
             | the port to the 5V rail with 2 A overcurrent protection
             | 
             | Except Macs, Macbooks, iMacs, I _think_ also at least the
             | Thunderbolt Display from  <very many years ago>. They all
             | have a software overcurrent protection that is _very_
             | triggerhappy. No negotiation and it will whine and shut the
             | offending device off, and same if the negotiated current
             | draw is exceeded.
             | 
             | Might be worth a try somewhen when I'm rich enough to
             | afford a dynamic resistor bank to verify all the
             | characteristics...
        
               | userbinator wrote:
               | I've looked at Macbook (pre M1) schematics; they do the
               | same as any other PC laptop. The USB power switches do
               | not have adjustable current limits.
        
               | kalleboo wrote:
               | I've never had any issues running dumb USB loads off any
               | of my MacBooks. Just tested it, no problem running 1.7 A
               | of dumb resistors just soldered to the power pins with
               | nothing on the data pins at all (not even the passive
               | "apple charging" resistors)
               | https://kalleboo.com/linked/usb-dummy-load.jpg
               | 
               | Macs _will_ shut down a port if it goes over 2.4 A (IIRC)
               | without USB-PD negotiation (mainly with the cable rather
               | than the device).
               | 
               | But the USB standard says they should limit to 100 mA
               | without USB 1.x negotiation, and it's not doing that.
        
             | indrora wrote:
             | > But literally every device in the real-world just wires
             | the port to the 5V rail with 2 A overcurrent protection
             | 
             | Not quite. To be USB Compliant, you have to do some work
             | here and there. There's about six different options. The
             | most common _is_ overcurrent detection, such as is seen in
             | [1]. There is a whole specification built by USB-IF on how
             | to handle higher current ( "battery charge") situations,
             | spurred by apple [2], with all sorts of weird corner cases
             | [3].
             | 
             | Now, USB-C changes that and specifically calls out that a
             | "compliant" downstream device has to negotiate USB PD or
             | declare yourself a USB-2.0 type-C device. [4] It's not
             | uncommon for newer devices that conform strictly to the
             | USB4 specification to not even power a port that hasn't
             | negotiated USB-PD or Legacy PD -- if you encounter devices
             | that get weird when powered via a usb-c to usb-c cable but
             | work fine on a usb a-to-c cable, you've seen someone skimp
             | out on $0.00001 in resistors.
             | 
             | [1] https://www.microchip.com/en-us/development-tool/EVB-
             | USB2514... [2] https://www.usb.org/document-
             | library/battery-charging-v12-sp... [3]
             | https://www.graniteriverlabs.com/en-us/technical-blog/usb-
             | ba... [4] https://community.infineon.com/t5/Knowledge-Base-
             | Articles/Te....
        
         | 0xTJ wrote:
         | The HDMI source, not the HDMI sink, provides the power at 5 V.
         | As far as I know, every Chromecast required an external power
         | connection.
        
         | globnomulous wrote:
         | My office stereo has physical connections between the following
         | devices (simplifying a bit)
         | 
         | - Speakers connect via speaker wire to monoprice 7x200 amp
         | 
         | - Monoprice amp connects via RCA to denon x3800h
         | 
         | - X3800h receives HDMI from desktop computer and sends HDMI to
         | a monitor.
         | 
         | - Same computer connects via Displayport to the same monitor
         | 
         | I used to hear an infuriating buzz when my 2080TI started to
         | work hard. It changed depending on the screen output, GPU
         | strain, and mouse activity but was constant. It acted like a
         | combination ground loop cum coil whine.
         | 
         | The first fix I discovered was to ground my monoprice amp to
         | the 2080 TI PCB by wrapping one end of the exposed-copper (12
         | awg, I think) grounding wire through and around one of the
         | holes in the board and attaching the other end to the Monoprice
         | amp's grounding pin.
         | 
         | This fixed the issue completely.
         | 
         | Then I realized I could fix the issue more elegantly and
         | elminate the need for grounding: I removed the grounding wire
         | and replaced my normal HDMI and Displayport cables with fiber
         | optic HDMI and Displayport cables. The buzz has never recurred.
         | 
         | I've never delved further into the problem, but my conclusion
         | is the same as yours: there's a design fault somewhere on the
         | board, which is causing electricity to flow in ways it
         | shouldn't. I'm using an MSI z690 ddr4 edge wifi board. Same
         | brand, same generation, as the board where this guy is having
         | his SSD power issue.
         | 
         | I still hear a weird, loud buzz through the stereo (including a
         | separate amp and separate pair of speakers) when my partner
         | runs her hair dryer upstairs, even though my stereo runs on its
         | own separate circuit, so regardless of the design issues in the
         | board, there's definitely also an issue in my electrical
         | system.
        
           | transpute wrote:
           | Power conditioner can improve AC isolation
           | 
           | https://www.amazon.com/Furman-AC-215A-Conditioner-Auto-
           | Reset...
           | 
           | https://surgestop.com/surge-products/m-474.html
        
             | globnomulous wrote:
             | Thanks, this is great advice. I'm using two SurgeX SX
             | 2120-SEQ power conditioner+sequencers -- one for the
             | desktop devices and one for the stereo.
             | 
             | I'm baffled that, even with the conditioners and even
             | though I'm a separate circuit in my office, the hairdryer
             | is still able to do _something_ to affect the electricity
             | in my office.
        
               | alduin32 wrote:
               | > the hairdryer is still able to do something to affect
               | the electricity in my office.
               | 
               | This may indicate that your neutral line is undersized
               | and/or damaged.
        
               | globnomulous wrote:
               | How could I test this?
        
               | alduin32 wrote:
               | A first thing to test would be that your voltages are
               | nominal, but the exact details depend on how many phases
               | are coming from the transformer, how they are wired, and
               | whether you are on a TT, TN-C-S or other kind of
               | grounding system, which depends mostly on where you live.
               | Also, you need to take your voltages both at low
               | impedance (simulates a load) and at high impedance
               | (negligible load, "classical" meters are generally high
               | impedance).
               | 
               | Generally, you want to measure the voltage difference
               | between live and neutral depending on the load. However,
               | depending on the tools you have access to, taking this
               | reading properly can be a bit tricky both because simple
               | high-impendance multimeters can easily be tricked by
               | ghost voltages caused by bad connections and inductions
               | from other cables, and also because understanding what to
               | measure requires knowing how is the electrical system
               | wired.
               | 
               | If you know you are in a TT system with 240V between
               | Live/Neutral, I can tell my procedure for inspecting
               | neutrals. In a two-pole TN-C-S system with 120V between
               | L1/Neutral and 240V between L1/L2, I suppose it would be
               | similar, expect that we'd have to do more tests (both L1
               | and L2 to neutral, and I imagine also L1 to L2).
               | 
               | EDIT: a first simple check to do is to check, using any
               | multimeter, if there is voltage drop in your office when
               | the hairdryer is in use.
        
           | tinfever wrote:
           | Interestingly, the PCIe 8-pin power cable into a GPU doesn't
           | carry all of the return current. If you put a current clamp
           | meter around the +12V wires and then the ground wires, you'll
           | measure more amps on the +12V wires than the ground wires.
           | This means some of the return current goes through the PCIe
           | slot into motherboard and makes its way back to the PSU. This
           | lets the GPU create audio noise because GPUs draw high
           | current pulses at the frame rate of your monitor, which means
           | the return current through the motherboard has high current
           | pulses, which can create ground bounce on the motherboard
           | where the ground voltage level moves up and down and that can
           | affect other devices in the system.
           | 
           | I don't totally know how that noise would traveling over the
           | ground shield of the HDMI cable into the analog section of
           | the Denon receiver though. Maybe some of that GPU return
           | current is going through the HDMI cable, through the Denon
           | receiver to mains earth, and then through your building
           | wiring back to the ATX PSU? Grounding is freaking weird.
        
             | globnomulous wrote:
             | Oh, wow, yeah, that's really interesting. I don't
             | understand electricity or know nearly enough about
             | electrical engineering to be sure I understand the effect
             | or flow you're describing, but if I (dimly) grasp what
             | you're saying, it would explain exactly the behavior I
             | observed.
             | 
             | Grounding really is incredibly weird (and, again, I say
             | this as someone who is shamefully ignorant of electrical
             | principles). It's no surprise that some 'audiophiles'
             | become so superstitious about electricity. Its behavior in
             | a stereo can be mysterious. Just looking at an amp funny
             | seems like enough to cause a ground loop.
        
       | jauntywundrkind wrote:
       | I can't get my Crucial P3+ to wake from sleep.
       | 
       | I'd like to dig in more but I haven't had this issue with any
       | other SSD in this system. Pretty close to saying I'm done with
       | Crucial.
        
         | NewJazz wrote:
         | I've had a similar experience with a crucial nvme drive, but a
         | kernel update seems to have introduced a quirk-based fix. Not
         | sure how much of a kludge that fix is, though.
        
           | wtallis wrote:
           | The quirks tables in the Linux NVMe drivers are impressive
           | and depressing:
           | 
           | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin.
           | ..
           | 
           | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin.
           | ..
           | 
           | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin.
           | ..
           | 
           | And they're not even close to being comprehensive.
        
             | fulafel wrote:
             | Interesting that there are also some anti-quirk special
             | cases in the vendor combo function (second link above), so
             | a certain platform is excepted from the quirk workaround:
             | \*        \* Exclude some Kingston NV1 and A2000 devices
             | from        \* NVME_QUIRK_SIMPLE_SUSPEND. Do a full suspend
             | to save a        \* lot fo energy with s2idle sleep on some
             | TUXEDO platforms.        \*/       if
             | (dmi_match(DMI_BOARD_NAME, "NS5X_NS7XAU") ||
             | dmi_match(DMI_BOARD_NAME, "NS5x_7xAU") ||
             | dmi_match(DMI_BOARD_NAME, "NS5x_7xPU") ||
             | dmi_match(DMI_BOARD_NAME, "PH4PRX1_PH6PRX1"))        return
             | NVME_QUIRK_FORCE_NO_SIMPLE_SUSPEND;
        
               | wtallis wrote:
               | I think some of those issues probably stem from the fact
               | that there's not really any alignment between the NVMe
               | spec and the PCIe spec with respect to power management
               | capabilities. I've encountered drives that have implicit
               | dependencies where certain NVMe power management features
               | only work as intended when certain PCIe power management
               | features are available, but there's no way for the drive
               | to express those requirements to the host system, and no
               | standard compliance test suite that will reveal the
               | broken behavior that can occur in the wild.
               | 
               | Sometimes figuring out who to blame for misbehaving
               | hardware requires custom kernel patches, a hardware
               | protocol analyzer at the M.2 slot, and reverse-
               | engineering the motherboard firmware. Most of the entries
               | in the quirks tables are based on a lot of guess-work and
               | inferences because the kernel developers don't have the
               | resources to fully investigate and reproduce these kinds
               | of issues (and the hardware vendors simply don't care
               | about thoroughly ironing out these bugs). It really sucks
               | when you have to look at power flow out of the laptop
               | battery and try to figure out from that whether your SSD
               | is pulling more power than it should.
        
               | fulafel wrote:
               | Wow. I guess this also explains some of the s2idle
               | troubles, with S3 sleep there are the vendor-tested
               | motherboard+peripheral combos that are shown to work with
               | the power states attempted by suspend and any hw/fw bugs
               | get troubleshot before they make it out of the vendors
               | lab.
        
               | jandrese wrote:
               | Oh yeah, and in some cases if the system attempts to go
               | into S2 sleep it simply bricks the SSD forever. I lost a
               | whole lab worth of drives once before I figured it out.
               | The vendor was the opposite of helpful, refusing to
               | acknowledge the problem and then wiping their hands of it
               | and walking away. The only solution I've found requires a
               | hardware modification of the drive, downloading a rip of
               | the vendor's internal repos from a sketchy russian
               | website, building a new firmware from scratch, and then
               | flashing it with some custom hardware.
        
             | hulitu wrote:
             | That would explain why, sometimes, my linux will not find
             | the NVME SSD when booting. (MSI mobo with Kingston SSD).
        
               | chupasaurus wrote:
               | Model or at least year of that SSD? Early on Kingston
               | used faulty controllers that randomly fail to initiate
               | and degrade with power cycles.
        
               | hulitu wrote:
               | Since last year.
        
               | doubled112 wrote:
               | I have a pair of ASUS Vivobook laptops with Kingston
               | NVMEs.
               | 
               | While running the factory install of Windows, those NVMEs
               | would cause a BSOD every third or so boot. Clean install
               | didn't help either, nor any driver or firmware update.
               | 
               | No Linux install has shown any signs of problems.
        
         | wtallis wrote:
         | Is this on a Linux system? NVMe power management has always
         | been hit or miss for consumer SSDs under Linux because the SSD
         | vendors don't write their firmware against the NVMe spec, they
         | write it to work with the Microsoft Windows NVMe driver and any
         | feature Windows doesn't use is liable to be broken. This
         | applies to basically every SSD brand, by the way.
        
           | jauntywundrkind wrote:
           | Yes, it's an NVMe.
           | 
           | Western Digital & OCZ nvme drives have both worked fine in
           | this system, so I'm feeling a bit salty about this. Would
           | like to try some Samsung drives at some point.
           | 
           | (Running Linux 6.11.7 atm.)
        
         | Astronaut3315 wrote:
         | I returned a Crucial P3+ after I discovered a massive
         | performance degradation with Bitlocker. It was slower than
         | spinning rust. Seems these drives have some unresolved firmware
         | issues.
        
       | zamadatix wrote:
       | On the topic of odd failure modes involving Crucial SSDs and MSI
       | motherboards (though one that seems to actually be the drives
       | fault) I have a t705 which at some point started only coming up
       | as x2 lanes instead of x4 no matter which board I put it into
       | (with no visible damage or indication as to why, though I did try
       | to wipe down the contact side with some rubbing alcohol anyways).
       | 
       | The particularly interesting part is I have a new x870
       | motherboard which supports m.2 slot 2 as being 0x, 2x, or 4x CPU
       | direct lanes depending if you want 4x, 2x, or 0x to go to the USB
       | 4 ports respectively. At first it sounds like a good combo - put
       | the drive which wants to run at x2 only in the extra slot where
       | x2 only mode is a reasonable tradeoff and still get great
       | bandwidth because those lanes are pcie 5 and not through the
       | chipset. For whatever reason though that drive only ever comes up
       | in an x4 slot (at x2 speed) but not any x2 slots I've tried. I
       | don't know enough about PCIe to assume why that is for sure but
       | it seemed odd to me it was any way but "something is wrong with
       | the 3rd or 4th lane and setting the slot to x2 lets the first 2
       | work at x2 the same as when the slot is set to x4 and it only
       | comes up as x2".
        
         | magicalhippo wrote:
         | PCIe devices are required to boot up using x1 lane only, and
         | then negotiate further lanes with upstream.
         | 
         | AFAIK it shouldn't matter if they're direct to CPU or not, at
         | least not logically.
         | 
         | I note the drive is Gen5 capable, does it negotiate x2 5.0
         | lanes or something else?
        
           | zamadatix wrote:
           | Negotiates to 2x 5.0 so long as the board it's plugged into
           | supports it. 2x 4.0 or 3.0 otherwise. Hadn't tested even
           | lower.
        
         | tfwnopmt wrote:
         | I came across this in a manual/datasheet:
         | 
         | > _16.Link Width Negotiation in the Presence of Bad Lanes
         | 
         | >In an effort to maximize the link width when one or more lanes
         | of a multi-lane link are not functioning correctly (i.e.,
         | reliable communication of training sets across the lane is not
         | possible), PES64H16G2 down-stream switch ports automatically
         | attempt a lane reversed configuration when doing so has the
         | potential to enhance the achievable link width.For example, if
         | lane 1 of a x4 link is not operating correctly, the device's
         | downstream switch port attached to the link attempts a lane
         | reversed configuration to form a x2 link using lanes 2 and 3
         | (Figure 7.4(d)). If the link partner accepts the lane reversed
         | configuration, the optimal x2 link will be formed using lanes2
         | and 3. If the link partner does not accept the lane reversed
         | configuration, but instead requests a lane configuration
         | supported by the PES64H16G2 (e.g., x1 link using lane 0), the
         | device accepts the configuration and forms the reduced width
         | link. Otherwise, if the lane numbering agreement fails, the
         | device automatically re-trains the link from the Detect state.
         | During this re-training, the PES64H16G2 port does not re-
         | attempt a lane reversed configuration, but rather tries to form
         | the link without reversing the lanes. As a result, a x1 link is
         | formed using lane 0 (Figure 7.4 (e)). _
         | 
         | My guess is it's likely a bad BGA solder ball on Lane1, or
         | possibly ESD damage if you took the SSD out and molested it or
         | rubbed it on a cat right before it broke. Does it indicate it's
         | using reversed lanes?
        
           | zamadatix wrote:
           | Nice digging, that lines up perfectly with the observed
           | behavior! I'll have to poke around and see if anything
           | indicates that's the operational mode to be sure.
           | 
           | The failure mode was that one day I just noticed it was
           | copying sequential data from another drive slower than it
           | normally did. Don't recall it ever having been touched after
           | install (it is the heatsinkless variant of the T705 4TB
           | mounted on the motherboard m.2 hearsink for that slot). Temps
           | always reported quite reasonable, even when under stress
           | bench load (which was rare, the drive was just a secondary
           | drive for loading games). Since then it's been popped between
           | about 10 boards in confusion though haha. No cat yet!
        
         | lizknope wrote:
         | I just got a Crucial T700 last month which is a PCIE Gen5 x4
         | NVMe M.2 drive.
         | 
         | I put it in an ASUS PRIME Z890M-PLUS motherboard with an Intel
         | Core Ultra 7 265K
         | 
         | Started to install Fedora Linux version 41. The drive would
         | just completely disappear from the OS and the kernel would
         | report I/O errors on a missing device. Sometimes this happened
         | during the initial install. Sometimes 5 minutes after the
         | install when starting a terminal. I couldn't even type "ls"
         | because the "ls" command is on the drive that went away.
         | 
         | Saw reports of PCIE Gen5 incompatibilities so I moved it to a
         | Gen4 slot and then it worked.
         | 
         | But the machine had so many other random crashes and errors
         | reported in system logs saying "This is a hardware error not
         | software" and stuff like that. Returned it all.
         | 
         | Just got an AMD Ryzen 9 9950X and Gigabyte X870E AORUS PRO
         | 
         | The Gen5 drive seems to be working at Gen5 speeds.
         | 
         | lspci -vv shows
         | 
         | 02:00.0 Non-Volatile memory controller: Micron/Crucial
         | Technology T700 NVMe PCIe SSD (prog-if 02 [NVM Express])
         | LnkSta: Speed 32GT/s, Width x4
        
       | geor9e wrote:
       | Why's a random tech support forum post from yesterday with 2
       | people replying getting reposted to HN
        
         | aprilnya wrote:
         | I personally found it interesting.
        
         | frantathefranta wrote:
         | Slow week but people probably enjoy the methodical
         | troubleshooting.
        
         | ejiblabahaba wrote:
         | For what it's worth, this post just helped me explain several
         | years of failure to wake from sleep state, across several
         | different MSI-based machines, when I've connected them to an
         | HDMI port in my TV. I think this debug is interesting in its
         | own right, and unlike 99% of the content on this website, it
         | was directly and immediately useful to me. I doubt I'm the only
         | one, too.
        
         | transpute wrote:
         | This post described a rare interoperability failure with
         | unexpected root cause, of possible interest to:
         | Motherboard designers       People upgrading PCs/laptops
         | SSD firmware developers       BIOS developers attempting PCIe
         | device boot       OS/hypervisor developers attempting PCIe
         | device reset
         | 
         | If you don't like this HN story, you could contribute your
         | first story to HN.
        
       | sebazzz wrote:
       | I have something similar with my webcam, which is connected to my
       | Samsung monitor usb hub, which is connected to a usb-c dongle,
       | which is connected to my work laptop.
       | 
       | If my laptop crashes during a Microsoft Teams call, possibly due
       | to the webcam, it will not show up in Windows again without it
       | physically being disconnected from the USB hub in my Samsung
       | monitor. I can disconnect the USB-C dongle or the monitor from
       | USB, change ports, power off the laptop, it doesn't matter
       | because that doesn't work. Only physically disconnecting and
       | reconnecting it makes it show up in device manager again.
        
       | qingcharles wrote:
       | I hate faults like that.
       | 
       | Used to work in PC repair. Man brings in PC, mouse right click
       | doesn't work. Everything else operates perfectly.
       | 
       | Replaced in this order: mouse, IO card, hard drive with fresh OS,
       | RAM, CPU, graphics card, motherboard. Still no right-click.
       | 
       | Replaced the PSU last. Right-click works. FML.
        
         | Frenchgeek wrote:
         | You didn't have to replace the house's wiring at least
         | (Happened to an aunt of mine: Gave her a computer, it worked
         | perfectly outside of her home. The electrician was a tad
         | horrified. She still scoffed when I suggested the computer
         | wasn't the problem first.)
        
           | Moru wrote:
           | I plugged my old Atari into an outlet in the old basement in
           | a different building. The HDD-cable started burning.
           | 
           | Electric company plugged in some device to measure power over
           | time. Turns out the power was slightly below normal but
           | within tollerances. The OEM power supply that was powering my
           | Atari wasn't up to standards. If I remember right, badly
           | designed PSU's can feed too high current if the voltage is
           | too low. Or something like that, was a very long time ago...
        
             | ajb wrote:
             | Many switch mode power supplies will increase the current
             | draw if the voltage drops, that's why many of them will
             | work on both 120 and 248V, while old school power supplies
             | need a manual switch. I had a brownout once and thought my
             | washing machine was broken because that was the only thing
             | that stopped working (Until evening when I switched on the
             | lights. That was back in the days of incandescents, oddly
             | though led lights still dim with lower power, I don't know
             | how they do voltage conversion).
             | 
             | We have so many cheap power supplies in our houses that it
             | would not surprise me if at least some become unsafe if the
             | source voltage drops too low. Being unsafe with only a
             | slight drop is weird though.
        
         | ksec wrote:
         | >Replaced the PSU last. Right-click works. FML.
         | 
         | My experience is always replace DRAM, and then PSU, and then
         | Swap Motherboard.
         | 
         | I don't think people realise how many faults there are with
         | DRAM, PSU and MB. DRAM quality has gotten a lot better in the
         | past 10 years so that is less of an issue. PSU, however it
         | where cost cutting are and more often than not causes problems.
        
         | donalhunt wrote:
         | Reminds me of an old hwops story where one machine just
         | constantly failed despite replacing every part on the tray
         | multiple times. The conclusion was that the tray was bad.
         | 
         | Google's definition of a server was (and still is afaik) based
         | on the tray (chassis) so there was no way to replace it. IIRC
         | it was "retired" with vengeance leaving a gap in the cabinet --
         | a warning to other trays to behave.
        
       | userbinator wrote:
       | This is a good cautionary story of why random parts-swapping can
       | be a waste of time and money. Getting out the DMM and measuring
       | voltages is something fewer and fewer people know how to do when
       | troubleshooting electronics, but it certainly saved the OP here;
       | I'd go a little further and figure out why the monitor seems to
       | be leaking power into its HDMI input when switched off ---
       | possibly an ESD-damaged MOSFET or similar?
       | 
       |  _The issue does not occur when the monitor is connected via
       | DisplayPort._
       | 
       | https://en.wikipedia.org/wiki/DisplayPort#DP_PWR_(pin_20)
       | 
       |  _Standard DisplayPort cable connections do not use the DP_PWR
       | pin._
       | 
       | There's also an interesting paragraph there, about some
       | nonstandard cables connecting that pin through.
        
         | Arcanum-XIII wrote:
         | Not all DMM have probe small enough to connect to the lane. If
         | it's even possible. What's more, you need to know where to put
         | it, which can be daunting without the proper knowledge.
         | Switching hardware is easier, faster and often the best
         | solution in those case.
         | 
         | Finding hardware fault is hard. Tracing it is even harder.
        
           | userbinator wrote:
           | I think there's something wrong with your DMM probes if you
           | can't measure the ATX power connector with them.
        
         | hamandcheese wrote:
         | On the other hand, I recently fried a motherboard while trying
         | to probe it with a multimeter. My fat fingers shorted out two
         | adjacent pins, causing a loud spark and magic smoke.
        
       | bunnie wrote:
       | Reading the thread it looks like the issue is leakage power on
       | the internal 3.3v line. When the system is off 1.9v is still
       | present. This is not uncommon, although 1.9v is a bit high. A lot
       | of laptops have explicit active pull downs on power supplies to
       | clamp them to zero when power is off to ensure peripherals are
       | not accidentally powered on by stray leakage (because laptops are
       | extremely low power by design and there is not enough stray
       | leakage to bring the power lines down in a sleep state). My guess
       | is main boards might not have this feature because normally there
       | is enough off state loading that it takes care of itself. however
       | maybe in this case the loading is not enough.
       | 
       | A dirty fix could be to just put a static load on the 3.3v line
       | to ground. I'd start with a 1/4w resistor around 100 ohms and
       | just stick it from 3.3v to ground to see if that does not soak up
       | the stray current. if it works just leave it, it's about 0.1
       | watts of static power and no big deal for a non portable setup.
       | 
       | The larger picture is that the controller on the nvme might not
       | hit its power on reset condition because it may be rated to run
       | at 1.8v (just a guess), so 3.3v is not going low enough for the
       | controller to perceive the system has been power cycled. Usually
       | a supplemental power monitor is needed in those cases to ensure a
       | reset is generated in case of leakage problems like this.
        
         | starslab wrote:
         | Hi! I'm the OP from the Level1Techs thread.
         | 
         | That HDMI power has some grunt behind it. During power-off
         | state with that 1.90v phantom voltage, I put a 48ohm resistor
         | between 3V3 and ground, the phantom voltage only dropped to
         | 1.80v, and the SSD still didn't work when I powered the machine
         | back on.
        
           | oneplane wrote:
           | Depending on the PMIC and the SSD DC conditioning, even 1.2v
           | might be enough for it to brownout/latchup without self-
           | resetting. (or it might power up the PHY partially or in a
           | bad state and never link up)
           | 
           | Try more resistors in series? (or just a bigger one if you
           | have any -- scratch that we needed smaller ;-) ).
        
             | starslab wrote:
             | 12 ohms brings the rail down to 1.47 volts, still no SSD. 6
             | ohms is enough to finally break/trip whatever circuit is
             | allowing this situation, bringing the rail down to 0v in
             | power-off. Of course, that's almost 2 watts of constant
             | draw during the power-on state, so not a long-term
             | solution.
        
               | oneplane wrote:
               | Oof, that is a giant leak somewhere. It's really sad we
               | have to go to some shady websites to find schematics for
               | mainboards, otherwise we could just get to the cause of
               | this pretty quickly.
        
               | numpad0 wrote:
               | 6 Ohms! Might as well just jumper it(don't)
               | 
               | Does it sound like reverse current through SBD? They have
               | higher reverse current and leaky I-V curve. 3.3V of drop
               | must mean something inline.
        
             | starslab wrote:
             | > scratch that we needed smaller ;-)
             | 
             | Well... Needed smaller in terms of resistance, but needed
             | bigger in terms of power rating, in the interests of not
             | catching fire.
        
       | okanat wrote:
       | I bought the same model SSD for my Thinkpad P1 last month and saw
       | the exact issue. I had to return it because it was breaking the
       | NVMe detection completely. So it wasn't a broken unit but a
       | design issue after all?
        
       | BearOso wrote:
       | Since we're talking SSDs, I wonder if we could get some attention
       | to the Phison E18 degradation issue [1]. Only one manufacturer,
       | Kingston, has put out firmware containing Phison's fix, while the
       | others just ignore it.
       | 
       | A bunch of these drives with this controller were on sale during
       | black Friday, so a lot more people are going to have problems in
       | a month or so.
       | 
       | 1.
       | https://www.reddit.com/r/pcmasterrace/comments/1f1piwf/psa_p...
        
         | userbinator wrote:
         | That sounds like NAND degradation (retention failures) which
         | can only be partially worked around in firmware (and causing
         | more write cycles on already-marginal QLC). Unfortunately the
         | real solution is "use better NAND", which is unlikely to happen
         | unless enough people demand it.
        
           | ciupicri wrote:
           | Kingston KC3000 supposedly uses Micron 176L TLC memory [1].
           | 
           | The Seagate Firecuda 530 datasheet clearly says "Built with a
           | Seagate-validated E18 controller and the latest 3D TLC NAND".
           | A review is more precise: "Phison PS5018-E18" & "Micron B47R
           | 176-layer 3D TLC NAND" [2].
           | 
           | [1]: https://www.tomshardware.com/reviews/kingston-
           | kc3000-m2-ssd-...
           | 
           | [2]: https://www.kitguru.net/components/ssd-drives/simon-
           | crisp/se...
        
             | userbinator wrote:
             | B47R is indeed TLC, rated for only 1000 cycles (and 35k in
             | SLC mode, at 1/3 the capacity.) There's also the question
             | of whether this is "true" Micron NAND, or SpecTek which is
             | basically Micron's rejects (and rated for even fewer
             | cycles; only 300 in the case of their B16A.)
        
         | ciupicri wrote:
         | Kingston doesn't seem to offer any support for Linux, so their
         | new firmware is virtually non-existent to me. Why can't I just
         | download the firmware and use standard nvme-cli tools to update
         | the SSD, beats me. If Seagate (which by the way uses Phison E18
         | too) can do it, so can Kingston, Samsung, Crucial, Western
         | Digital and many others.
         | 
         | Even better would be use Linux Vendor Firmware Service
         | (https://fwupd.org/).
        
       | amelius wrote:
       | I have a similar problem with a Jetson board. If I turn off the
       | power long enough (one night) and then turn it on, the only PCI
       | card is not recognized and I have to power-cycle it to get it
       | running.
        
         | structural wrote:
         | Mind sharing what board/Jetson module you've seen this on? I've
         | seen this exact symptom very intermittently on a custom board
         | and we've wondered for a long time if was an issue with a
         | specific type of module (or manufacturing lot of modules).
        
           | amelius wrote:
           | This one: https://www.avermedia.com/professional/product-
           | detail/D315%2...
           | 
           | My startup logic now power-cycles it until the PCI board is
           | recognized; it works, but it's not a great solution.
        
             | structural wrote:
             | Interesting, we're using a completely different module
             | (Xavier NX). And the same, disgustingly hacky, fix, of
             | forcing a reset until it works.
        
               | amelius wrote:
               | I also run these commands:                   echo 1 >
               | /sys/bus/pci/rescan         sleep 1
               | 
               | Sometimes it brings the PCI card back, so I just run this
               | as part of my boot sequence.
        
       | undertaken wrote:
       | anecdotal/weird computer experience:
       | 
       | I have a rebadged Tongfang laptop (NB02 GMxRGxx w/ Ryzen 9) and
       | upgraded it shortly after purchase.
       | 
       | The machine arrived with lower capacity Samsung SODIMMs. Swapped
       | in 64GB of Crucial DDR5.
       | 
       | Shortly afterwards the machine became instable to the point of
       | RMA. Kernel logs clogged with all sorts of panics related to
       | NVMe, PCIe, and filesystem. Freezing. Reboots.
       | 
       | Spent hours diagnosing it. Many permutations of kernel command
       | line arguments; pcie, acpi tables, iommu. All for naught.
       | 
       | The machine passed memtest86 / memtest86+ with flying colors.
       | 
       | bonnie++ absolutely trashed it. reliably.
       | 
       | Occasionally the NVMe drives fell off the pci bus and it wouldn't
       | boot until I disable the slot in bios, power cycle, then re-
       | enable the slot.
       | 
       | Fast forward to me getting fed up with a dysfunctional system, I
       | attmepted RMA and gave them the rundown of all the weird
       | seemingly chipset related failures.
       | 
       | They pushed back with "Try our RAM again."
       | 
       | I nearly had an aneurysm when everything was stable again.
       | 
       | After thanking the support staff profusely I bought larger
       | capacity Samsung DIMMs in the same chip family. Still running
       | flawlessly after almost a year.
       | 
       | Maybe try new RAM for yucks? ;)
        
       | bb88 wrote:
       | So these guys [1] mention something similar where HDMI from a TV
       | is backfeeding 40-50 volts into a cable box. This could be
       | because of many things from electrical outlet wiring to power
       | supply issues on the monitor to a bad component on the monitor
       | giving a high voltage, or the monitor is badly grounded, etc,
       | etc.
       | 
       | I read the original thread but it doesn't look like you've
       | measured the voltage at the HDMI port wrt motherboard ground. I
       | think we're assuming it's 5 volts, but it could be higher, and it
       | could have shorted (or weakened) a component on your motherboard.
       | And that would explain why a 100 ohm resistor didn't give a
       | meaningful voltage drop.
       | 
       | If you need an isolation solution, Amazon sells a 50ft fiber
       | optic one way HDMI cable [2]. The thing I don't know is if
       | there's any actual copper to provide power over the link. There
       | are other options which transmit the HDMI signal over pure
       | multimode fiber as well [3].
       | 
       | Or you can go with a DP KVM, since you're on L1T, they sell a few
       | DP models. I have one I purchased from L1T, and I like it a lot.
       | 
       | Definitely though I would check out the outlets to make sure they
       | were wired correctly. Incorrectly wired outlets because someone
       | tried to DIY it in the US is absolutely a problem.
       | 
       | [1] https://www.avsforum.com/threads/hdmi-cable-backfeeding-
       | volt...
       | 
       | [2] https://www.amazon.com/HDMI-FURUI-HDCP2-2-18Gbps-
       | Subsampling...
       | 
       | [3] https://fibercommand.com/products/8k-fiber-plugs?gQT=1
        
         | starslab wrote:
         | I already own one of those fiber-hdmi cables. Brilliant, but
         | sometimes doesn't interoperate with DVI devices using passive
         | DVI -> HDMI adapters. I've no idea if it has any copper
         | conductors for HDMI power, though one end is labelled for the
         | source and one for the display, suggesting that however it's
         | designed it's not bi-directional.
         | 
         | I'd love a DisplayPort KVM, but not every device that comes
         | across my bench has a DisplayPort output, and those few that
         | have DisplayPort but no HDMI can be accommodated with one of
         | those commodity DisplayPort -> HDMI adapters. This situation is
         | actually getting worse over time, not better, as many modern
         | devices and laptops are skipping DisplayPort in favor of USB-c
         | alt-mode.
         | 
         | This issue has actually been going on through a monitor change
         | on my testbench. It has happened with a Samsung SyncMaster 204T
         | though my KVM switch, an HP ZR24w through my KVM switch, and
         | the ZR24w directly connected. I don't think this is an issue
         | with the rest of my equipment.
         | 
         | This electrical was done about 15 years ago, by a ticketed
         | electrician. One of those $5 plug testers indicates all is
         | well, and I have no reason to believe there's any issue here.
         | 
         | By almost pure coincidence, I have an MSI PRO X870-P
         | motherboard on order. I'm looking forward to seeing if this
         | same 3V3 leakage issue is present on this board too.
        
       | bdavbdav wrote:
       | I had the same on an AORUS X570. Displayport cables with a line
       | tied both ends (shouldn't be, but many are) would cause BIOS
       | resets, corruption and memory retraining.
        
       | blagie wrote:
       | I was an exclusive user of Crucial for memory and storage until
       | about a year ago. My general thought was that:
       | 
       | - It would give me a trusted supply chain, since the company
       | makes the silicon; and
       | 
       | - I would have a credible standing behind it, which wasn't likely
       | to want to tarnish its reputation cutting corners.
       | 
       | The thinking was very much along the lines of "No one got fired
       | buying IBM." And I think it was pretty correct for most of the
       | past quarter-century. Historically, storage had a lot of
       | counterfeits and shenanigans, and a credible vendor was nice.
       | Price/performance for memory was adequate; there was a modest
       | premium.
       | 
       | However, post-2020, I bought a defective Crucial DIMM (and didn't
       | find out it was defective until I was past the return window).
       | The RMA experience was strange. Crucial said they could either:
       | 
       | - Replace it with an inferior part with different, slower
       | timings, which may or may not have worked in my system
       | 
       | - Give me a quickly-expiring store credit for "fair market value"
       | (never disclosing what that was, and stopped responding to emails
       | when I asked)
       | 
       | Neither of these was helpful at all.
       | 
       | Reading online, there were many similar stories, unfortunately.
       | They seem to be going the same direction as Sandisk / Western
       | Digital. I replaced it with a cheap TeamGroup DIMM which worked
       | without problems.
       | 
       | I'm not quite sure what to do about the continued
       | enshitification. There seem to be almost no credible brands left.
        
       | sciencesama wrote:
       | I have a similar issue with nvme on a wlan slot on the lenovo
       | thinkpad gen 8 !!
        
       ___________________________________________________________________
       (page generated 2024-12-29 23:02 UTC)