https://kcall.co.uk/ssd/index.html
Everything I know about SSDs
Solid State Devices using NAND Flash, how they differ from
Hard Drives, and how they affect file deletion and
recovery
March 2019:
Introduction:
I started writing this rather long page for my own
benefit, when I acquired a upgrade from my old 2006 Win 8
250 gb HDD PC to a Dell Optiflex 3010 with a 128 gb SSD. I
never used more than 30 or 40 gb of the system drive, and
I'm not a gamer or an avid film or music collector either.
Not on a PC anyway. As I played with my new kit the
further I went I realised that I knew very little about
NAND flash in SSDs, just how SSDs work, how do they read
and write and store data, and what sort of trickery do
they employ? I can visualise an HDD, writing tiny magnetic
patterns on a rotating surface, but SSDs are different,
vastly different.
There's also quite a few misconceptions about SSDs which
seem persistent, and it would be nice to examine them if
not perhaps quash a few of them. Perhaps I was guilty of
harbouring quite a few misconceptions myself. However it
started, this article grew into, shall we say, a mid-level
technical discussion. If all you need to know is that SSDs
are quiet, reliable, fast, and will work for years, then
there's no need to read any further. If however, you think
that knowing how to read a 3D TLC NAND flash cell is
interesting, then you have little option but to plough on.
As much of the detailed information as possible has been
sourced from corporate and private technical articles,
with quite a lot from Seagate and WD, and the wonderfully
named Flash Memory Summit. Some of the conclusions I've
made are from just trying to apply what logic I can along
with common sense. Such is the complexity of NAND flash
controllers, the variance in their methods of operation,
and the speed of their development, that trying to
comprehend let alone keep up with them is difficult to say
the least. I can't say whether what I've written isn't
confusing or is even true, but it's more of a guide than a
bible. There'll be some repetition too. And it will soon
be out of date.
I am obliged to those I have borrowed from, and will also
be obliged to those who point out any errors without any
reward apart from that of contribution. I've tried to
explain what is different with SSDs, and why it is so hard
to grasp with our ingrained HDD minds.
The first misconception might be the plural of SSD:
gramatically it should be, so I'm told, SSDs, but SSD's is
almost as commonplace. Here I will stick to one SSD, many
SSDs.
Software and hardware:
This article was written in 2019 onwards and deals almost
exclusively with NAND flash in the form of pc or laptop
storage devices we know as SSDs. I shan't complicate
things even more by referring to the ubiquitous flash
drive or other NAND flash devices. If significant
differences exist I shall try to note them as and when
that occurs, but the default is the internal drive.
Nowhere here is there anything about flash storage in
phones, etc.
Most of the detail was produced whilst my PC was running
Windows 10 Home, with a fairly modest internal 2.5" WD
Green 120 gb SSD. This uses a Silicon Motion SM2258XT
controller and four 32 GiB SanDisk 05497 032G 15nm 3D TLC
memory chips with an inbuilt SLC cache of unknown
capacity. As this article tries to discuss the behaviour
of SSDs as a whole it shouldn't matter what host operating
or file system is used, but in my case it's Windows and
NTFS. Nothing here is specific to a particular brand or
type of SSD, it should all be generic. We're really
dealing with the principles of SSD operation.
The only additional software applications I have used are
Piriform's excellent Recuva, which can list both live and
deleted files and their cluster allocations, and HexDen
(HxD), a very usable and capable hex editor. Recuva is
free from www.piriform.com, and HexDen is also free from
www.mh-nexus.de. I use the portable versions of both
pieces of software.
All the conclusions and opinions here are entirely my own
work, and any data taken from my own pc. It would be wise
to verify, or at least agree with my reasoning, before
accepting these words as the truth. Much of this is a
simplified explanation of a very complex subject.
SSD Physical Internals:
Poking inside an SSD is something of a disappointment, a
small pc board with a few NAND flash chips and a
controller chip, lightweight and a little flimsy. As for
the software inside the controller, I can only summarise
the basic tasks. It seems commonplace that controllers are
bought in from external manufacturers, as indeed are the
memory chips. SSD controller software is proprietary, very
complex and highly guarded, but all controllers have to do
basic tasks, even if we don't quite know how. Only those
tasks can be discussed here, the very clever tweaks and
tricks will have to remain known only to the manufacturer.
I'll start with a little groundwork.
NAND Flash:
I wasn't going to delve into the internals of NAND flash,
there are enough frankly bemusing articles on Wikipedia
for all that. All you really need to know is that NAND
(NOT-AND) flash memory stores information in arrays of
cells made from floating-gate transistors. The floating
gate can either have no charge of electrons, and be in an
'empty' logical state, or be charged with electrons at
various voltage thresholds and be in a logical state which
represents a value. NAND flash is non-volatile and retains
its state even when the SSD is not powered up. Oh yes,
it's called flash because a large chunk of cells can be
erased (flashed) at a time.
But if you want to know more, go ahead. Here the term cell
and transistor refer to the same physical entity and are
used interchangeably, and I won't keep saying NAND all the
time.
Flash memory comprises multiple two-dimensional arrays of
transistors, and supports three basic operations, read,
program (write) and erase. Apart from the flash arrays,
the flash chip includes command and status registers, a
control unit, decoders, analogue circuits, buffers, and
address and data buses. A separate chip holding the SSD
controller sends read, program, or erase commands to the
flash chip. In a read operation the controller passes the
physical address to the flash chip which locates the data
and sends it back to the controller. in a program
operation the data and physical address are passed to the
chip. In an erase operation, only the physical address is
passed to the chip.
The flash chip's latches store data transferred to and from
the flash arrays, and the sense amplifiers detect bit line
voltages during read operations. The controller monitors
the command sent to the chip using the status register.
The controller also includes Error Checking and Correction
(EEC) algorithms to manage error and reliability issues in
the chip and to ensure that correct data is read or
written.
Each row of an array is connected by a Word line, and each
column by a Bit line. At the intersection of a row and
column is a Floating Gate Transistor, or cell, where the
logical data is stored. Word lines are connected to the
transistors in parallel, and bit lines in series. The ends
of the bit lines are connected to sense amplifiers.
Flash arrays are partitioned into blocks, and blocks are
divided into pages. Within a block the cells connected to
each word line constitute a page. The cells connected to
the bit lines give the number of pages in a block. Common
page sizes are 4k, 8k or 16k, with 128 to 256 pages making
a block size between 512k and 4mb. A page is the smallest
granularity of data that can be addressed by the chip
control unit.
Read or program operations involve the chip controller
selecting the relevant block using the block decoder, then
selecting a page in the block using the page decoder. The
chip controller is also responsible for activating the
correct analogue circuitry to generate the voltages needed
for program and erase operations.
Although the number of cells in each row is nominally
equivalent to the page size, the actual number of cells in
each row is higher than the stated capacity of each page.
This is because each page contains a set of spare cells as
well as data cells. The spare cells store the ECC bits for
that page as well as the physical to logical address
mapping for the page. The controller may also save
additional metadata information about the page in the
spare area. During a read operation, the entire page
(including the bits in the spare area) is transmitted to
the controller. The ECC logic in the controller checks and
correct the read data. During a program operation the
controller transmits both the user data and the ECC bits
to the flash memory.
Upon system boot the controller scans the spare area of
each page in the entire flash array to load the logical to
physical address mapping into its own memory (there may be
other techiques for holding mapping data in the
controller). The controller holds the logical to physical
address mapping in the Flash Translation Layer (FTL). The
FTL also performs garbage collection to clear invalid
pages following writes, and performs wear-leveling to
ensure that all the flash blocks are used, evenly.
Since flash does not support in-place updates, a page
needs to be erased before its contents can be programmed;
but unlike a program or a read operation which work at a
page granularity, the erase operation is performed at a
block granularity.
2D and 3D, and Layers:
In flash architecture a block of planar flash, a
two-dimensional array of cells, is rather unsurprisingly
called 2D flash. If one (or more) array is stacked on top
of each other then it's 3D flash. 3D NAND flash is built
on one chip, up to 32 layers, and was devised to drive
costs down when planar flash reached its scaling limit: 3D
flash costs little more than 2D to produce, but multiplies
the storage capacity immensely. In both 2D and 3D the
cells in each page (the rows) are connected by Word Lines,
and the cells at each offset within a page (the columns)
are connected with a Bit Line (to put it very simply).
3D flash is not the same as layered flash, where separate
very thin chips are arranged in a stack. This is
prohibitively expensive. Most modern consumer SSDs (in the
2010's) use 3D TLC flash.
Can I see one?
The cell size on end-user flash is minute, with 15nm being
common, and ranges from 43nm down to 12nm. Actually cell
size, or cell diameter, is misleading, as the stated size
is not a measurement of any dimension of a cell but a
measure of the distance between discrete components on the
chip. The silicon layers on the chip are approximately 0.5
to 3nm thick: by comparison a hydrogen atom is 0.1nm in
diameter, and the silicon atoms used in chip manufacture
0.2nm. A nanometre (nm) is indeed exceedingly small, a
billionth of a metre, and as an analogy if one mn were the
size of a standard marble (about 13mm) then one metre
would be the size of the earth. The power of a billion is
impressive.
SLC, MLC, TLC, QLC and Beyond:
A Single-Level Cell (SLC) has one threshold of electron
charge to indicate the state of one bit, one or zero. A
Multi-Level Cell (MLC) holds a voltage denoting the state
of two bits, with three different thresholds representing
11, 10, 00 and 01. A Triple-Level Cell (TLC) holds the
state of three bits, 111, 110, 100, 101, 001, 000, 010,
and 011. The 15 thresholds used in Quad-level cells (QLC)
can be deduced if anyone is at all interested. (I have
seen other variations of what these threshold values
represent in bit terms.)
Unfortunately when the double level cell was developed it
was called a multi-level cell and given the acronym MLC,
thus forcing everyone to type out multi-level cell
laboriously when they want to refer to multiple level
cells. If only it had been called a double-level cell we
could use DLC, TLC, and QLC freely and use MLC to describe
the lot, but it's too late for that now. If only flash had
stopped at SLC, with its yes/no one/zero state, these
explanations would be far easier to write, and hopefully
far easier to grasp.
With multi-level cells physical NAND pages represent two
or more logical pages. The two bits belonging to a MLC are
separately mapped to two logical pages. Odd numbered pages
(including zero) are mapped to the least significant (RH)
bit, and even numbered pages are mapped to the most
significant (LH) bit. Similarly, the three bits belonging
to a TLC are separately mapped to three logical pages, and
a QLC is mapped to four logical pages (The page numbering
for TLC and QLC is unknown).
The more bits a multi-level cell has to support affects
the cell's performance. With SLC the controller only has
to check if one threshold has been exceeded. With MLC the
cell can have four values, with TLC eight, and QLC 16.
Reading the correct value of the cell requires the SSD
controller to use precise voltages and multiple reads to
ascertain the charge in the cell. It's also apparent that
if a single physical page supports multiple logical pages
then that page will be read and written more frequently
than a SLC page, with consequent affect on its life
expectancy. Furthermore it would seem self-evident that a
TLC SSD would need only a third of the physical cells
required in an SLC device, so my 120 gb TLC SSD would
actually hold only 40 gb of NAND cells.
High-use enterprise SSDs used to be the province of the
SLC, with it's greater speed, endurance, reliability and
read/write capabilities, MLC and TLC are gaining
acceptance for enterprise use. The end-user consumer SSD
market gets the cheaper higher capacity but slower and
more fragile multiple level cells.
Why is Nothing One?
Anyone still following this may have noticed a common
factor in both single and multi-level cells, in that an
empty cell - where the floating gate has no charge -
represents one. Unlike HDDs, where any bit pattern can be
written anywhere, a default logical state of ones is
present on an empty SSD page. This is because there is
only one programming function on the cells, to move
electrons across the floating gate. NAND flash cells can
only be programmed to a state of zero, there is no ability
to program a one. With multi-level calls the default is
still one across all pages, but a logical one can be
represented even after the cell has been programmed and
there are electrons present across the gate.
Ever since Fibonacci introduced the Hindu-Arabic numeral
system with its concept of zero into European mathematics
in 1202, the human mind associates zero with empty and one
with full. To be empty and represent one is rather
perplexing, and appears to be mainly from convention (an
empty state could represent zero but would required
inverters on the data lines). Possibly the circuitry is
less complex, and possibly the ability of an empty cell to
conduct a charge implies that it is a one.
They're all SLC anyway:
After all this it's perhaps worth emphasising that NAND
flash, whatever its intended use, is all physically SLC.
If you could look into a TLC cell you wouldn't see 101, or
011, or whatever. There can only ever be one quantity of
electrons in a cell, no matter how that quantity is
interpreted. The SSD controller knows whether the cells
are to be treated as SLC, MLC etc and programs them
accordingly, measures the electron count, and determines
what logical value it represents. But even quad cells can
only contain one value, just as do SLC cells.
The Myths and Misconceptions:
And now we come to the myths, misconceptions and the real
reason for writing this article, what happens when an SSD
page is read, written and rewritten, and how does this
affect deleted file recovery? On one hand we have NTFS,
designed specifically for HDDs way before SSDs became
easily available, NAND flash with its own unique way of
operating, and several billion humans with years of
ingrained HDD use and expectations. And here, if I haven't
already, I shall use SSD interchangeably if incorrectly
for NAND flash.
Storage Device Controllers:
All HDDs, SSDs and flash drives have an internal
controller. It's the way that the storage device can be,
in the words of Microsoft, abstracted from the host. That
abstraction is done by logical block addressing, where
each cluster capable of being addressed on the storage
device is known to the host by an ascending number (the
LBA). The storage device controller maps that number to
the sectors or pages on the device. To the host this
mapping is constant - a cluster remains mapped to the same
LBA until the host changes it. On an HDD this relationship
is physical and fixed: in its simplist deconstruction an
HDD controller just reads and writes whatever sectors the
host asks it to. It doesn't have to think about what was
there before, it just does what it's told and writes new
data on top of the old. It does that because it can,
there's nothing preventing a new cluster being written
directly on top of the same sectors of an old one. On an
SSD it's different.
With an SSD the host still uses the LBA addressing system
with the constant reconciliation between LBA and cluster
number. It knows that the device is a SSD and has a few
tricks to accommodate this, but they will come later. The
SSD controller however has many tricks to reconcile the
host's file system, written for HDDs, with the demands of
NAND flash.
Flash Translation Layer:
The host still uses LBA addressing to address the SSD for
read and writes, as it knows no other. These commands are
intercepted by the Flash Translation Layer on the SSD
controller. The FTL maintains a map of LBAs to physical
block addresses, and and passes the translated PBA to the
controller. This map is required because unlike an HDD the
LBA to PBA relationship is volatile. It's volatile because
of the way data is written to NAND flash.
An empty page, with all cells uncharged, contains by
default all ones. If a hex editor is used to look at an
SSD's empty sectors however, it will be presented with
clusters of zeroes. This is because empty pages are not
allocated to the LBA/PBA mapping table. Instead, if a read
request is issued for an empty page a default page of
zeroes is returned. This applies to both unallocated
clusters and those which are part of a file: the SSD does
not allocate a page and change all its cells from ones to
zeroes.
Floating Gate Transistors:
This section might be helpful before plunging into reads
and writes, and here cell and (FG)transistor become
interchangable (a cell is a transistor). For more, much
more, about floating gate MOSFET
(Metal-Oxide-Semiconductor Field-Effect Transistors) there
is always Wikipedia.
A FGMOS transistor has three terminals, gate, drain, and
source. When a voltage is applied to the gate a current
can flow from the source to the drain. Low voltages
applied to the gate cause the voltage flowing from source
to drain to vary proportionally to the gate voltage. At a
higher voltage the proportional response stops and the
gate closes regardless.
The charge in the floating gate alters the voltage
threshold of the transistor, i.e. at what point the gate
will close. When the gate voltage is above a certain
value, around 0.5 V, the gate will always close. When the
voltage is below this value, the closing of the gate is
determined by the floating gate voltage.
If the floating gate has no charge then a low voltage
applied to the gate closes the gate and allows current to
flow from source to drain. If the floating gate has a
charge then a higher voltage needs to be applied to the
gate for it to close and current to flow. The charge in
the floating gate changes how much voltage must be applied
to the gate in order for it to close and conduct.
SSD Reads:
There's nothing inherent in the design of NAND flash that
prevents reading and writing to and from individual cells.
However in line with NAND flash's design goal to be simple
and small the standard commands that NAND chips accept are
structured such that a page is the smallest addressable
unit. This eliminates space that would be needed to hold
additional instructions and cell-to-page maps.
To read a single page, and the cells within it, the page
needs to be isolated from the other pages within the
block. To do this the pages not being read are temporarily
disabled.
All cells/transistors in the same block row (a page) are
connected in parallel with a Word Line to the transistors'
gates. All transistors in the same block column (cell
offset) are chained in series with a Bit Line connecting
the drain of one to the source of the next. At the end of
each bit line is a sense amplifier. When a read takes
place a pass-through voltage is applied to all word lines
except the page being read. The pass-through voltage is
close to or higher than the highest possible threshold
voltage and forces the transistors in all pages not being
read to close whether they have a stored charge or not.
All bit lines are energised with a low current.
The word line for the page being read is given a reference
voltage, and all the bit line sense amplifiers read.
Transistors holding a high enough electron count will not
be closed by the reference voltage, and the bit line
current will not pass through the source/drain chain to
the sense amplifier. Transistors with no charge, or a
charge below the threshold, will be closed by the
reference voltage and conduct the bit line current to the
sense amplifier. Several reads at varying threshold
voltages are required to determine the logical state of a
multi-level cell.
(To add, or avoid, more confusion a floating gate
transistor can be either open or closed. An open gate does
not conduct an electrical charge, and a closed gate does.
So if the gate is open nothing can get through, and if
it's closed it can. No wonder we're confused.)
It can be seen from this that to read one page in a block
requires that all transistors in every page in the block
receive either a pass-through or one or more reference
voltages. It also appears that this will still apply even
if some or all of the other pages in the block are empty.
This becomes significant in Read Disturbance below.
Interpreting the results:
It is quite easy to grasp the concept behind reading an
SLC. Only one threshold applies to SLC flash so only one
test voltage is required - the floating gate either will
or will not close. if the threshold voltage closes the
gate then the bit line current passes to the sense
amplifier and the stored value is one. If it doesn't then
it's zero.
Multi-level cells are different, and the reasoning behind
the stored value bit order becomes apparent. In a MLC the
possible user bit combinations are 11, 10, 00 and 01,
separated by three threshold values. To read the most
significant (l/h) bit only requires one read, of the
middle threshold voltage. If the gate closes then the MSB
is one, if it doesn't then the MSB is zero, no matter what
is in the least significant bit. To read the LSB (r/h) two
reads are required, one of threshold one, and one of
threshold three. If the read of threshold one closes the
gate then the LSB is set to one and read two is not
required. If the gate opens then a read of threshold three
is taken. If it closes the LSB is set to zero. If it
doesn't then the value is one.
The bit combinations in TLC cells are 111, 110, 100, 101,
001, 000, 010, and 011, separated by seven threshold
values, and are more tricky to grasp. The MSB bit again
only requires one read of the middle threshold, as in MLC.
The central bit requires two reads, at threshold two and
six, and the LSB requires four reads, at thresholds one,
three, five and seven.
All multi-cell pages based on the MSB (l/h) are treated as
SLC, with only one read required to determine the user bit
value.
SSD Writes:
The most significant aspect of NAND flash, the widest fork
in the HDD/SSD path, and the fundamental, pivotal factor
in what follows, is that data can only be written to an
empty SSD page. This is not new, nor is it in any way
unknown, but it has the greatest implications for data
security and recovery.
While SSDs can read and write to individual pages, they
cannot overwrite pages, as the voltages required to revert
a zero to a one would damage adjacent cells. All writes
and rewrites need an empty page. Unlike HDDs, where a
compete cluster is written to the disk whatever was there
previously, the act of writing an SSD page allocates an
empty page with its default of all ones, and an electrical
charge is applied to the cells that require changing to
zeroes. This is as true for multi-level cells as it is for
SLCs, as the no-charge all-ones pattern is either replaced
with a charge representing another pattern, or is left
alone. This is a once-only process.
When a write request is issued an empty page is allocated,
usually within the same block, and the data written. The
LBA/PBA map in the FTL is updated to allocate the new page
to the relevant LBA. The LBA will always remain the same
to the host: no matter which page is allocated the host
will never know. This is the same process if the user data
is being rewritten or if it is a new file allocation: the
only difference is that the rewrite will have slightly
more work to do. The old page will be flagged as invalid
and will be inacessible to the host, but will still take
up space within its block as it cannot be reused.
Whilst it's easy to grasp writing to SLC pages,
multi-level cell pages are more difficult to visualise.
The controller accumulates new writes in the SSD cache
until enough logical pages to fill a physical page are
gathered, and then writes the physical page. This entails
the fewest writes to the page. If a logical page in a
multi-level page is amended it would require a new page to
be allocated and all logical pages rewritten, as the
individual values in the physical page can't be altered.
If a logical page is deleted then I surmise that the
deleted logical page is flagged as invalid, and when the
block becomes a candidate for garbage collection any valid
logical pages are consolidated before writing. In other
words a multi-level page, or at least the majority of
them, will always contain a full compliment of logical
pages.
It's apparent that if NAND flash handles data writes in
this way - and it does - the SSD will eventually become
full of valid and invalid pages, and performance will
gradually slow to a crawl. Although an individual SSD page
can't be erased a block can, and this method is used to
return blocks to a writable state. To expedite this, and
to ensure that a pool of empty blocks is always available
for writes, the SSD controller uses Garbage Collection.
Garbage Collection:
Garbage Collection is enabled on the humblest up to the
highest capacity SSD: without it NAND flash would be
unusable. Garbage Collection is part of the SSD controller
and its work is unknown to the host. In its simplest form
GC takes a block holding valid and invalid pages, copies
the valid pages to a new empty block, updates the LBA
mapping tables, and consigns the old block to the invalid
block pool. There the block and its pages are reset to
empty state, and the block added to the available block
pool. Thus a pool of available blocks should always be
available for write activity. As long as there is power to
the SSD GC will do its work, it cannot be stopped. There
are various sophisticated techniques for GC routines, all
proprietary and mainly known only to the manufacturers.
When an SSD arrives new from the factory writes will
gradually fill the drive in a progressive, linear pattern
until the addressable storage space has been entirely
written. However once garbage collection begins, the
method by which the data is written - sequential vs random
- affects performance. Sequentially written data writes
whole blocks, and when the data is replaced the whole
block is marked as invalid. During garbage collection
nothing needs to be moved to another block. This is the
fastest possible garbage collection - i.e. no garbage to
collect. When data is written randomly invalid pages are
scattered throughout the SSD. When garbage collection acts
on a block containing randomly written data, more data
must be moved to a new block before the block can be
erased.
The Garbage Collection Conundrum:
Garbage Collection can either take place in the
background, when the host is idle, or the foreground, as
and when it is needed for a write. Whilst background GC
may seem to be preferrable, it has drawbacks. If the host
uses a power-saving mode when idle, GC will either wait
for the device to restart with a consequent user delay for
GC to complete, or wake the device up and reduce battery
life whilst the host is 'idle'. Furthermore GC has no
knowledge of the data it is collecting. Inevitably some
data will be subject to GC and then be deleted shortly
afterwards, incurring another bout of GC and consequent
additional and unnecessary writes (write amplification,
the ratio of actual writes to data writes). Foreground GC,
seemingly the antithesis of performance, avoids the
power-saving problems, only incurs writes when they are
actually required, and with fast cache and highly
developed GC algorithms presents no noticeable performance
penalty to the user. The trend in modern GC appears to be
foreground collection, or a combination of foreground and
background collection.
Based on foreground garbage collection, and that most user
activity is random, then the inevitable conclusion is that
the SSD will spend most of its life at full capacity, if
by that we mean available blocks, even though the
allocated space appears to the host to be low.
However there is another potential problem with SSDs, and
that is to do with a historical event: the way that file
systems were designed.
File Systems - What you see isn't what you get:
Host file systems were designed in the days when HDDs
reigned supreme, simply because SSDs had yet to arrive in
an available and affordable form. The file system does not
take into account the needs of NAND flash. Files are
constantly being updated: they get allocated, moved and
deleted, and grow and shrink in size. The way the file
system handles this is incompatible with the workings of
NAND flash.
It's worth emphasising that storage devices are abstracted
from the host operating system. Whilst an array of folders
and files are displayed by Explorer in a form wholly
comprehensible to a human, it's all an illusion. What
Explorer is showing is a logical construct created
entirely from metadata held within the file system's
tables. The storage device controller knows nothing about
files or folders, or tables or operating systems: all an
HDD or SDD sees are commands to read or write specific
sectors, which it does faithfully. An SSD has one
advantage over an HDD however, it knows that some pages
hold data, and are mapped to an LBA, and some pages are
empty, hold no valid data, and are not mapped to an LBA.
Conversely an HDD does not need to know this, to an HDD
all sectors are the same.
File Deletion:
In NTFS, when a file is deleted the entry in the Master
File Table is flagged as such, and the cluster bitmap is
amended to flag the file's clusters as available for
reuse. The delete process takes place entirely within the
MFT and the cluster bitmap. This is perfectly adequate for
an HDD, as NTFS can simply reuse the MFT entry and the
clusters whenever it wishes. On an SSD the process from
NTFS's point is exactly the same, as NTFS has no other way
of deleting files. However all the SSD sees is exactly
what an HDD would see, updates to a few pages. Neither an
HDD nor an SSD knows that it's the MFT and cluster bit map
being updated, as they have no knowledge of such things.
As there is no activity on the deleted file's clusters,
the SSD's pages holding the clusters remain mapped to
their LBAs in the FTL. The SSD's FTL has no way of knowing
that these pages are no longer allocated by NTFS: to the
SSD the pages are still valid and will not be cleaned up
by garbage collection.
As these 'dead' pages are allocated to an LBA they could
be released when files are allocated or extended and the
host uses that LBA. In this case the page will be flagged
as invalid and a new page used. However it is inevitable
that eventually a significant amount of unused and
unwanted baggage which is not flagged for garbage
collection will be pointlessly maintained by the SSD
controller and be unavailable for reuse. To overcome this,
and to correlate the hosts view of allocated and
unallocated pages with the SSD's, NTFS from Windows 7
onwards acquired the TRIM command.
SSD Detection:
Although the storage device is abstracted from the File
System, to enable some of the file system's SSD tweaks it
needs to know whether the device is an HDD or SSD. There
are various ways to do this, including querying the
rotational speed of the device, which on an SSD should be
zero (or perhaps one). This seems the most widely used and
most proficient method.
TRIM:
TRIM (it isn't an acronym) is a SATA command sent by the
file system to the SSD controller to indicate that
particular pages no longer contain live data, and are
therfore candidates for garbage collection. TRIM is only
supported in Windows on NTFS volumes. It is invoked on
file deletion, partition deletion, and disk formatting.
TRIM has to be supported by the SSD and enabled in NTFS to
take effect. The command 'fsutil behavior query
disabledeletenotify' returns 0 if TRIM is enabled in the
operating system. It does not mean that the SSD supports
it (or even if an SSD is actually installed) but all
modern SSDs support a version of it.
There are three different types of TRIM defined in the
SATA protocol and implemented in SSD drives.
Non-deterministic TRIM: where each read command after a
TRIM may return different data; Deterministic TRIM (DRAT):
where all read commands after a TRIM return the same data
(i.e. become determinate) and do not change until new data
is written; and Deterministic Read Zero after TRIM (DZAT):
where all read commands after a TRIM return zeroes until
the page is written with new data. By the way whilst DRAT
returns data on a read it is not the userdata that was
ptrviously there bafore the TRIM: it is random.
Fortunately Non-Deterministic TRIM is rarely used, and
Windows does not support DRAT, so a read of a trimmed page
- which is easily done with a hex aditor - invokes DZAT
and returns zeroes immediately after the TRIM command is
issued. The physical pages may not have been cleaned
immediately following the TRIM command, but the SSD
controller knows that there is no valid data held at the
trimmed page address.
TRIM tells the FTL that the pages allocated to specific
LBAs are to be classed as invalid. When a block no longer
has any free pages, or a specific threshold is reached,
the block is a candidate for garbage collection. Live data
is copied to a new empty block, and the original block is
erased and made available for reuse.
TRIM is an asynchronous command that is queued for
low-priority operation. It does not need or send a
response. The size of the TRIM queue is limited and in
times of high activity some TRIM commands may be dropped.
There is no indication that this takes place, so some
unwanted pages may escape garbage collection.
RETRIM:
Windows Defragger - now called Storage Optimiser - has an
option to Optimise SSDs. This does not defrag the SSD but
sends a series of TRIM commands to all unallocated pages
identified in NTFS's cluster bitmap. This global TRIM (or
RETRIM) command is run at a granularity that the TRIM
queue will never exceed its permitted size and no RETRIM
commands will be dropped. A RETRIM is run automatically
once a month by the storage optimiser.
Over-provisioning:
All NAND flash devices use over-provisioning, additional
capacity for extra write operations, controller firmware,
failed block replacements, and other features utilised by
the SSD controller. This capacity is not physically
separate from the user capacity but is simply an amount of
space in excess of that which can be allocated by the
host. The specific pages within this excess space will
vary dynamically as the SSD is used. According to Seagate,
the minimum reserve is the difference between binary and
decimal naming conventions. An SSD is marketed as a
storage device and its capacity is measured in gigabytes
(1,000,000,000 Bytes). NAND flash however is memory and is
measured in gibibytes (1,073,741,824 bytes), making the
minimum overprovisioning percentage just over 7.37%. Even
if an SSD appears to the host to be full, it will still
have 7.37% of available space with which to keep
functioning and performing writes (although write
performance will be diabolical). Manufacturers may further
reduce the amount of capacity available to the user and
set it aside as additional over-provisioning, in addition
to the built-in 7.37%. Additional over-provisioning can
also be created by the host by allocating a partition that
does not use the drive's full capacity. The unallocated
space will automatically be used by the controller as
dynamic over-provisioning.
My humble WD SSD has four 32 gb chips but a specified
capacity of 120 gb, meaning that it has 8 gb set aside as
additional over-provisioning. Add this to the 7.37%
minimum (9.4 gb) and the 17.4 gb equates to almost 15%
over-provisioning space.
Wear Levelling:
Some files are written once and remain untouched for the
rest of their life. Others have few updates, some very
many. As a consequence some blocks will hardly ever see
the invalid block pool and have a very low erase/write
count, and some will be in the pool every few minutes and
have a very heavy count. To spread the wear so that all
blocks are subject to erase/writes equally, and the
performance of the SSD is maintained over its life, wear
levelling is used. Wear levelling uses algorithms to
indentify blocks with the lowest erase count and move the
contents to high erase count blocks; and to select low
erase count blocks for new allocations. As with garbage
collection, wear levelling is far more complex than I
could possibly deduce, let alone explain.
Read Disturbance:
SSD reads are not quite free, there is a price to pay. As
described above, a read of one page generates a
pass-through voltage on all other cells in the block. This
voltage is likely to be below the highest threshold value
that could be held by the cell, but it still generates a
weak programming effect on the cells, which can
unintentionally shift their threshold voltages. The
pass-through voltage induces electric tunnelling that can
shift the voltages of the unread cells to a higher value,
disturbing the cell contents. As the size of flash cells
is reduced the transistor oxide becomes thinner and in
turn increases this tunnelling effect, with fewer read
operations required to neighbouring pages for the unread
flash cells to become disturbed, and move into a different
logical state. Cells holding lower threshold values are
more susceptible to read disturbance.
Thus each read can cause the threshold voltages of other
unread cells in the same block to shift to a higher value.
After a significant amount of reads this can cause read
errors for those cells. A read count is kept for each
block and if it is exceeded the block is rewritten. The
count is high for SLC cells, around 1m, lower for 25 nm
MLC at around 40,000, and much lower for 15 nm TLC cells.
File Recovery:
And now we come to deleted file recovery. NTFS goes
through exactly the same process to delete a file on an
SSD as it does on an HDD, with the exception of the
additional TRIM command. And the TRIM command (assuming
it's executed) and a few SSD quirks destroys any
practicable chance of deleted file recovery.
TRIM commands, as described above, have a complimentary
setting within the SSD controller in the form of DRAT and
DZAT. (I don't believe that non-deterministic TRIM is used
in any reputable SSD, and I don't think that Windows
supports DRAT, but I have no proof.) The implementation of
DZAT means that immediately on successful execution of the
TRIM command (which will in most cases be immediately on
file deletion) any attempt to read the TRIMed page will
return zeroes. The data on the page will still exist until
the block is processed by the garbage collector, but that
data is not accessible from the host by any practicable
means, or any general software.
Garbage collection is independent of the host device and
will be invoked at the will of the SSD's controller. Once
the process is started it cannot be stopped, apart from
powering off the SSD. Once powered up again the garbage
collector will resume its duties to completion.
Deleted file recovery on a modern SSD is next to
impossible for the end user, and under Windows as close to
impossible as you can get. A theoretical examination of
the chips would most likely show compressed and encrypted
data, striped over multiple blocks, and no possibility of
relating one page of data to another across the multiple
millions of pages. There is a very small possibility of
recovering recently deleted files by powering off the SSD
immediately and sending it to a professional data recovery
company. They may recover some data, given enough time and
money.
After a session of file deletion, such as running
Piriform's CCleaner, run Recuva on the SSD. The headers of
the deleted files found (and presumably the rest of the
file) will all be zeroes. This is TRIM and DZAT doing
their work in a few seconds, killing any chance of deleted
file recovery. Of course TRIM can be disabled, at the cost
of performance, but it's probably better to be a little
less cavalier when deleting files that might be wanted
later.
Deletd File Security:
The notion of secure file deletion - overwriting a file's
data before deletion - is irrelevant, and if any other
pattern except zeroes is chosen is just additional and
pointless wear on the SSD. Even overwriting with zeroes
will cause transaction log and other files to be written,
so secure file deletion on an SSD should never be used.
Wiping Free Space is far worse for pointless writes, and
is even more futile than secure file deletion. The deleted
files just aren't there any more.
The OCZ Myth:
Some years ago (as a little light relief to all these
acres of text) the OCZ forums were buzzing with the latest
method of regaining performance on their SSDs: run
Piriform's CCleaner Wipe Free Space, with one overwrite
pass of zeroes. Although performance may have been
regained, logic, and common sense, went out of the window.
The theory was that overwriting the pages with zeroes was
equivalent to erasing blocks (this was before the days of
TRIM). This was nonsense, and should have been apparent
from the start. The default state of an empty page is all
ones, not zeroes, and how could a piece of software
possibly erase NAND flash?. The real reason was that as
CCleaner was filling the pages with zeroes the SSD
controller simply unmapped the pages and showed default
pages of zeroes to the host. The invalid pages were then
candidates for garbage collection, which gave a much
greater pool of blocks to call upon on writes, and hence a
better performance. A sort of RETRIM before that was
invented.
SSD Defragmentation:
One of the SSD mantras is that an SSD should never be
defragged. Whilst there is little (there is a little) to
be gained from rearranging clusters into adjacent pages -
an SSD has no significant overhead in random reads - an
SSD defrag is not entirely verboten. In fact from Windows
8 onwards the Storage Optimiser will defrag an SSD if
certain conditions are met. If System Restore is enabled,
the fragmentation level is above 10%, and at least one
month has passed since the last defrag, Windows Storage
Optimiser Scheduled Maintenance will defrag the SSD. This
is what Microsoft calls a Traditional Defrag, it is not an
Optimise (RETRIM). The defrag is required to reduce the
extents on the volume snapshot files when system restore
is enabled.
There is nothing to be afraid of in a monthly defrag. Most
users won't hit the 10% fragmented criteria so a simple
RETRIM will be run, and Windows 10 users won't get
defragged anyway (System Restore is disabled in Widows 10
by default). The reduction in life of an SSD will not be
noticed. Furthermore, although SSDs are not fazed by
random reads, files do get fragmented and that means a
significant increase in I/Os. An occasional clearup is a
boon.
SSD Lifetime: There are many users worried about the life
expectancy of their SSDs. Yes, continuous write/erase
cycles, and the added and unseen write amplification, do
take a toll on the life of NAND flash. Using an SSD does
wear it out. My WD Green 120 gb SSD, a TLC SSD from a
reputable manufacturer but at the very lowest cost, has an
estimated life of 1 million+ hours and a write limit if 40
terabytes. One million hours is 114 years, so we can
forget that. As for writes, at 1 gb a day - far more than
my current rate of data use - it would take the same 114
years to reach 40 tb. Even with massive write overhead
this SSD is not going to wear out in the forseeable
future. If all 128 gib of available flash is used equally,
the 40 tb equates to 312 writes per cell, a very
conservative number.
The End:
The only thing to add is that NAND flash, SSDs, and
especially SSD controllers, are far more sophisticated,
complex and incomprehensible than what has been written
here, what I know, what I could possibly comprehend, and
what I could possibly explain. I should also add secret,
as their software is proprietary. Whilst an HDD is a
marvel of complex electro-mechanical engineering at a
ridiculously low cost, the SSD is an equally marvellous
and complex piece of electronics and software at a
minimally higher cost. We should be thankful for both.
You can return to my home page here
If you have any questions, comments or criticisms at all
then I'd be pleased to hear them: please email me at kes
at kcall dot co dot uk.
(c) Webmaster. All rights reserved.