https://github.com/openzfs/zfs/pull/13392

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session.
{{ message }}
openzfs / zfs Public

  * Notifications
  * Fork 1.6k
  * Star 9.2k

  * Code
  * Issues 987
  * Pull requests 126
  * Discussions
  * Actions
  * Projects 6
  * Wiki
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Wiki
  * Security
  * Insights

New issue

Have a question about this project? Sign up for a free GitHub account
to open an issue and contact its maintainers and the community.

Pick a username
    [                    ]

Email Address
    [                    ]

Password
    [                    ]

[                    ] Sign up for GitHub
By clicking "Sign up for GitHub", you agree to our terms of service
and privacy statement. We'll occasionally send you account related
emails.

Already on GitHub? Sign in to your account

Jump to bottom

Block Cloning #13392

Merged
behlendorf merged 1 commit into openzfs:master from pjd:brt Mar 10,
2023
Merged

Block Cloning #13392

behlendorf merged 1 commit into openzfs:master from pjd:brt Mar 10,
2023
+3,480 -120 
Conversation 172 Commits 1 Checks 8 Files changed 51

Conversation

This file contains bidirectional Unicode text that may be interpreted
or compiled differently than what appears below. To review, open the
file in an editor that reveals hidden Unicode characters. Learn more
about bidirectional Unicode characters
Show hidden characters
pjd
Copy link
Contributor

@pjd pjd commented Apr 30, 2022 *
edited

Motivation and Context

Block Cloning allows to clone a file (or a subset of its blocks) into
another (or the same) file by just creating additional references to
the data blocks without copying the data itself. Block Cloning can be
described as a fast, manual deduplication.

Description

In many ways Block Cloning is similar to the existing deduplication,
but there are some important differences:

  * Deduplication is automatic and Block Cloning is not - one has to
    use a dedicated system call(s) to clone the given file/blocks.
  * Deduplication keeps all data blocks in its table, even those
    referenced just ones. Block Cloning creates an entry in its
    tables only when there are at least two references to the given
    data block. If the block was never explicitly cloned or the
    second to last reference was dropped, there will be neither space
    nor performance overhead.
  * Deduplication needs data to work - one needs to pass real data to
    the write(2) syscall, so hash can be calculated. Block Cloning
    doesn't require data, just block pointers to the data, so it is
    extremely fast, as we pay neither the cost of reading the data,
    nor the cost of writing the data - we operate exclusively on
    metadata.
  * If the D (dedup) bit is not set in the block pointer, it means
    that the block is not in the dedup table (DDT) and we won't
    consult the DDT when we need to free the block. Block Cloning
    must be consulted on every free, because we cannot modify the
    source BP (eg. by setting something similar to the D bit), thus
    we have no hint if the block is in the Block Reference Table
    (BRT), so we need to look into the BRT. There is an optimization
    in place that allows to eliminate majority of BRT lookups that is
    described below in the "Minimizing free penalty" section.
  * The BRT entry is much smaller than the DDT entry - for BRT we
    only store 64bit offset and 64bit reference counter.
  * Dedup keys are cryptographic hashes, so two blocks that are close
    to each other on disk are most likely in totally different parts
    of the DDT. The BRT entry keys are offsets into a single
    top-level VDEV, so data blocks from one file should have BRT
    entries close to each other.
  * Scrub will only do a single pass over a block that is referenced
    multiple times in the DDT. Unfortunately it is not currently (if
    at all) possible with Block Cloning and block referenced multiple
    times will be scrubbed multiple times.
  * Deduplication requires cryptographically strong hash as a
    checksum or additional data verification. Block Cloning works
    with any checksum algorithm or even with checksumming disabled.

As mentioned above, the BRT entries are much smaller than the DDT
entries. To uniquely identify a block we just need its vdevid and
offset. We also need to maintain a reference counter. The vdevid will
often repeat, as there is a small number of top-level VDEVs and a
large number of blocks stored in each VDEV. We take advantage of that
to reduce the BRT entry size further by maintaining one BRT for each
top-level VDEV, so we can then have only offset and counter as the
BRT entry.

Minimizing free penalty.

Block Cloning allows to clone any existing block. When we free a
block there is no hint in the block pointer whether the block was
cloned or not, so on each free we have to check if there is a
corresponding entry in the BRT or not. If there is, we need to
decrease the reference counter. Doing BRT lookup on every free can
potentially be expensive by requiring additional I/Os if the BRT
doesn't fit into memory. This is the main problem with deduplication,
so we've learn our lesson and try not to repeat the same mistake
here. How do we do that? We divide each top-level VDEV into 64MB
regions. For each region we maintain a reference counter that is a
sum of all reference counters of the cloned blocks that have offsets
within the region. This creates the regions array of 64bit numbers
for each top-level VDEV. The regions array is always kept in memory
and updated on disk in the same transaction group as the BRT updates
to keep everything in-sync. We can keep the array in memory, because
it is very small. With 64MB regions and 1TB VDEV the array requires
only 128kB of memory (we may decide to decrease the region size in
the future). Now, when we want to free a block, we first consult the
array. If the counter for the whole region is zero, there is no need
to look for the BRT entry, as there isn't one for sure. If the
counter for the region is greater than zero, only then we will do a
BRT lookup and if an entry is found we will decrease the reference
counters in the entry and in the regions array.

The regions array is small, but can potentially be larger for very
large VDEVs or smaller regions. In this case we don't want to rewrite
entire array on every change. We then divide the regions array into
128kB chunks and keep a bitmap of dirty chunks within a transaction
group. When we sync the transaction group we can only update the
parts of the regions array that were modified. Note: Keeping track of
the dirty parts of the regions array is implemented, but updating
only parts of the regions array on disk is not yet implemented - for
now we will update entire regions array if there was any change.

The implementation tries to be economic: if BRT is not used, or no
longer used, there will be no entries in the MOS and no additional
memory used (eg. the regions array is only allocated if needed).

Interaction between Deduplication and Block Cloning.

If both functionalities are in use, we could end up with a block that
is referenced multiple times in both DDT and BRT. When we free one of
the references we couldn't tell where it belongs, so we would have to
decide what table takes the precedence: do we first clear DDT
references or BRT references? To avoid this dilemma BRT cooperates
with DDT - if a given block is being cloned using BRT and the BP has
the D (dedup) bit set, BRT will lookup DDT entry and increase the
counter there. No BRT entry will be created for a block that resides
on a dataset with deduplication turned on. BRT may be more efficient
for manual deduplication, but if the block is already in the DDT,
then creating additional BRT entry would be less efficient. This
clever idea was proposed by Allan Jude.

Block Cloning across datasets.

Block Cloning is not limited to cloning blocks within the same
dataset. It is possible (and very useful) to clone blocks between
different datasets.
One use case is recovering files from snapshots. By cloning the files
into dataset we need no additional storage. Without Block Cloning we
would need additional space for those files.
Another interesting use case is moving the files between datasets
(copying the file content to the new dataset and removing the source
file). In that case Block Cloning will only be used briefly, because
the BRT entries will be removed when the source is removed.
Note: currently it is not possible to clone blocks between encrypted
datasets, even if those datasets use the same encryption key (this
includes snapshots of encrypted datasets). Cloning blocks between
datasets that use the same keys should be possible and should be
implemented in the future.

Block Cloning flow through ZFS layers.

Note: Block Cloning can be used both for cloning file system blocks
and ZVOL blocks. As of this writing no interface is implemented that
allows for ZVOL blocks cloning.
Depending on the operating system there might be different interfaces
to clone blocks. On FreeBSD we have two syscalls:

 ssize_t fclonefile(int srcfd, int dstfd);
 ssize_t fclonerange(int srcfd, off_t srcoffset, size_t length, int dstfd, off_t dstoffset);

Even though fclonerange() takes byte offsets and length, they have to
be block-aligned.
Both syscalls call OS-independent zfs_clone_range() function. This
function was implemented based on zfs_write(), but instead of writing
the given data we first read block pointers using the new
dmu_read_l0_bps() function from the source file. Once we have BPs
from the source file we call the dmu_brt_addref() function on the
destination file. This function allocates BPs for us. We iterate over
all source BPs. If the given BP is a hole or an embedded block, we
just copy BP. If it points to a real data we place this BP on a BRT
pending list using the brt_pending_add() function.

We use this pending list to keep track of all BPs that got new
references within this transaction group.

Some special cases to consider and how we address them:

  * The block we want to clone may have been created within the same
    transaction group as we are trying to clone. Such block has no BP
    allocated yet, so it is too early to clone it. In this case the
    dmu_read_l0_bps() function will return EAGAIN and in the
    zfs_clone_range() function we will wait for the transaction group
    to be synced to disks and retry.
  * The block we want to clone may have been modified within the same
    transaction group. We could potentially clone the previous
    version of the data, but that doesn't seem right. We handle it as
    the previous case.
  * A block may be cloned multiple times during one transaction group
    (that's why pending list is actually a tree and not an
    append-only list - this way we can figure out faster if this
    block is cloned for the first time in this txg or consecutive
    time).
  * A block may be cloned and freed within the same transaction group
    (see dbuf_undirty()).
  * A block may be cloned and within the same transaction group the
    clone can be cloned again (see dmu_read_l0_bps()).
  * A file might have been deleted, but the caller still has a file
    descriptor open to this file and clones it.

When we free a block we have additional step in the ZIO pipeline
where we call the zio_brt_free() function. We then call the
brt_entry_decref() that loads the corresponding BRT entry (if one
exists) and decreases reference counter. If this is not the last
reference we will stop ZIO pipeline here. If this is the last
reference or the block is not in the BRT, we continue the pipeline
and free the block as usual.

At the beginning of spa_sync() where there can be no more block
cloning, but before issuing frees we call brt_pending_apply(). This
function applies all the new clones to the BRT table - we load BRT
entries and update reference counters. To sync new BRT entries to
disk, we use brt_sync() function. This function will sync all dirty
top-level-vdev BRTs, regions arrays, etc.

Block Cloning and ZIL.

Every clone operation is divided into chunks (similar to write) and
each chunk is cloned in a separate transaction. To keep ZIL entries
small, each chunk clones at most 254 blocks, which makes ZIL entry to
be 32kB. Replaying clone operation is different from the regular
clone operation, as when we log clone operation we cannot use the
source object - it may reside on a different dataset, so we log BPs
we want to clone.
The ZIL is replayed when we mount the given dataset, not when the
pool is imported. Taking this into account it is possible that the
pool is imported without mounting datasets and the source dataset is
destroy before the destination dataset is mounted and its ZIL
replayed.
To address this situation we leverage zil_claim() mechanism where ZFS
will parse all the ZILs on pool import. When we come across
TX_CLONE_RANGE entries, we will bump reference counters for their BPs
in the BRT and then on mount and ZIL replay we will just attach BPs
to the file without bumping reference counters.
Note it is still possible that after zil_claim() we never mount the
destination, so we never replay its ZIL and we destroy it. This way
we would end up with leaked references in BRT. We address that too as
ZFS gives as a chance to clean this up on dataset destroy (see
zil_free_clone_range()).

How Has This Been Tested?

I have a test program that can make use of this functionality that I
have been using for manual testing.

Types of changes

  * [ ] Bug fix (non-breaking change which fixes an issue)
  * [*] New feature (non-breaking change which adds functionality)
  * [ ] Performance enhancement (non-breaking change which improves
    efficiency)
  * [ ] Code cleanup (non-breaking change which makes code smaller or
    more readable)
  * [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
  * [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil
    and libzfsbootenv)
  * [*] Documentation (a change to man pages or other documentation)

Checklist:

  * [*] My code follows the OpenZFS code style requirements.
  * [ ] I have updated the documentation accordingly.
  * [*] I have read the contributing document.
  * [ ] I have added tests to cover my changes.
  * [ ] I have run the ZFS Test Suite with this change applied.
  * [*] All commit messages are properly formatted and contain
    Signed-off-by.

Sorry, something went wrong.

 14 lin72h, jittygitty, satmandu, shodanshok, kapitainsky, ninewan,
cyphar, WRMSRwasTaken, ms-jpq, NavinF, and 4 more reacted with thumbs
up emoji  21 rincebrain, rustybird, behlendorf, lin72h, szubersk,
lukts30, szpght, thesamesam, jittygitty, amotin, and 11 more reacted
with hooray emoji [?] 14 lin72h, edillmann, jittygitty, amotin,
deepbluev7, kapitainsky, GauntletWizard, Alanaktion, mmatuska,
ms-jpq, and 4 more reacted with heart emoji  10 lin72h, thesamesam,
jittygitty, darkdragon-001, kapitainsky, cyphar, ms-jpq, kwinz,
network-shark, and artob reacted with rocket emoji
All reactions

  *  14 reactions
  *  21 reactions
  * [?] 14 reactions
  *  10 reactions

@lin72h
Copy link

lin72h commented May 3, 2022

It's finally happening, Thanks for the hard work

All reactions

Sorry, something went wrong.

@rincebrain
Copy link
Contributor

rincebrain commented May 3, 2022

First, thanks for this, it's been something I've looked forward to
since I heard about it on the drawing board.

A couple of thoughts, before I go play with this on my testbeds and
have more informed ones:

  * I think BRT winds up being strictly better than dedup for almost
    every use case (excepting if people don't have the time/initial
    additional space to burn to examine their data to be BRT'd after
    the fact). Consequently, one thing I've thought would be nice
    since I heard about the proposal was if we could eventually
    implement transparent "migrate to BRT" from a deduplicated pool -
    but I believe deferring to the DDT if the DDT bit is set would
    break the ability to do that later without another feature flag
    change later. Does that seem like a worthwhile idea/reason to
    consider ways to allow it to work without breaking changes? (I
    also just really would like to see existing dedup use dwindle to
    zero after BRT lands, just so my biases are clear.)
  * In a sharply contrasting but complimentary thought, another
    unfortunate loss if one uses BRT is the fact that you pay full
    size for your data on send/recv, until you perhaps go do an
    additional pass on it afterward...but that doesn't help your
    snapshots, which may be a substantial penalty. So another thing
    I've been curious about your thoughts on the feasibility of would
    be implementing additional hinting for zfs send streams that zfs
    send on its own would not generate, for "this received record
    should be a clone of this block", with the intent that one could
    implement, say, a zstream pipeline for being handed a
    periodically-updated-by-userland dedup table, and checking
    incoming blocks against it.

Is this a userland dedup implementation, similar to the one that just
got ejected from the codebase? Yes. Could you mangle your bytes if it
incorrectly claimed a block was a duplicate? Also yes. Would allowing
people to pay the computational cost to optionally BRT their incoming
send streams rather than live with the overhead forever be worth it,
IMO? A third yes.

(Not suggesting you write it, for this or in general, just curious
about your thoughts on the feasibility of such a thing, since "all
your savings go up in smoke on send/recv" seems like one of the few
pain points with no mitigations at the moment...though I suppose you
could turn on dedup, receive, and then use idea 1 if that's feasible.
Heh.)

  * You said currently scrub will iterate over BRT'd blocks multiple
    times - IANA expert on either BRT or the scrub machinery, but it
    seems like you could get away with a BRT-lite table where any
    time a block would get scrubbed, you check the BRT-lite table,
    and if it's not there, you check the BRT, and if it has an entry,
    you queue it and add an entry to the aforementioned lite table?
    (You could even just drop it all on the floor if the machine
    restarts, rather than persisting it periodically, and accept
    scrubbing more than once if the machine restarts.)

Come to think of it, I suppose if that works, you could just do that
for dedup entries too and stop having to scan the DDT for them, if
you wanted...which might make someone I know happy. (You might wind
up breaking backward compatibility of resuming scrubs, though, I
suppose. Blech.)

Likely missing some complexity that makes this infeasible, just
curious why you think this might not be a thing one could implement.

Thanks again for all your work on this, and for reading my uninformed
questions. :)

All reactions

Sorry, something went wrong.

@behlendorf behlendorf added the Status: Code Review Needed Ready for
review and testing label May 3, 2022
@rincebrain rincebrain mentioned this pull request May 3, 2022
Feature request: moving datasets #13393
Open
@pjd
Copy link
Contributor Author

pjd commented May 4, 2022 *
edited

@rincebrain Here are my thoughts on easier manual deduplication and
preserving deduplication during send/recv.

Let's say you would like to manually deduplicate the data on your
pool periodically. To do that you would need to scan your entire
pool, read all the data and build some database that you can check
against to see if the there is another copy of the given block
already.

Once you have such database, then you could use zfs diff to get only
the files that has changed and scan only them. Even better if we had
a mechanism to read all BPs of the given file without reading the
data - that would allow us to determine if the given block is newer
than the snapshot and only read the data of the blocks that are newer
(and not all the data within modified file).

One could teach zfs recv to talk to this database and deduplicate the
blocks on the fly.

Note that this database is pretty much DDT in userland.

Converting DDT to BRT is still doable. I think we would just need to
teach ZFS that even though the BP has the D bit set, it may not be in
the DDT.

As for the scrub, of course everything is doable:) Before you start
the scrub you could allocate a bitmap where one bit represents 4kB of
storage (2^ashift). Once you scrub a block you set the corresponding
bit, so the next time you want to scrub the same block, you will know
it already has been done. Such bitmap would require 32MB of RAM per
1TB of pool storage. With this in place we could get rid of the
current mechanism for DDT. This bitmap could be stored in the pool to
allow to continue scrub after restart (it could be synced to disk
rarely as the worst that can happen is that we scrub some blocks
twice).

 3 lin72h, rincebrain, and edillmann reacted with thumbs up emoji
All reactions

  *  3 reactions

Sorry, something went wrong.

@jittygitty
Copy link

jittygitty commented May 17, 2022 *
edited

@pjd Wow, I just noticed this (you work in stealth mode? :) ),
surprised at the progress, thanks! I had a question related to what
@rincebrain asked about zfs send/receive and was curious if you knew
how BTRFS was doing their reflink send preservation?

(I think the -c and -p options? But I've also read on email lists
that supposedly btrfs can preserve reflinks over send/receive.)
https://blogs.oracle.com/linux/post/
btrfs-sendreceive-helps-to-move-and-backup-your-data

So was wondering how they do it. thx

How can we donate anonymously? (to a specific dev, or openzfs
earmarked for specific developer?)
Also need to know if "donations" will be tax-exempt for recipient or
not. If they are not, sender should deduct as biz expense.
(Personally, I want my unsolicited donations to be accepted as
"Tax-Exempt gifts" by recipients, but double taxation of the same
money needs to be avoided, so donor and gift recipient don't both pay
taxes on donated funds, that's reason for question.)

 1 zfsbot reacted with thumbs down emoji
All reactions

  *  1 reaction

Sorry, something went wrong.

amotin
amotin reviewed May 20, 2022
View reviewed changes
Copy link
Member

@amotin amotin left a comment

 

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Thank you, Pawel. It is a cool feature. I've only started reading the
code, so only few first comments so far:

Sorry, something went wrong.

 1 lin72h reacted with thumbs up emoji
All reactions

  *  1 reaction

module/os/freebsd/zfs/zfs_vnops_os.c Outdated

  td->td_retval[0] = done;
  
  return (done > 0 ? 0 : error);
  }

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
As I see, FreeBSD already has sys_copy_file_range()/
kern_copy_file_range() and even linux_copy_file_range(). Aren't you
duplicating?

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 20, 2022 *
edited

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
To be honest I was under impression that copy_file_range(2) always
copied the data, just avoiding copying the data to and from userland,
but Linux's manual page does mention that reflinks can be used. So,
yes, that is a good candidate to use. I'd still prefer to have a flag
controlling this behavior, but on Linux, I guess, reflink is used
when available. One problem is that if offsets and length are not
recordsize-aligned, we would still need to implement the actually
copy of the non-aligned bits.

All in all, please ignore the syscalls for now, they were added so it
can be tested. This will be done separately, maybe under sponsorship
of the FreeBSD Foundation.

Sorry, something went wrong.

All reactions

module/zcommon/zpool_prop.c Outdated

  @@ -116,6 +116,9 @@ zpool_prop_init(void)
  zprop_register_number(ZPOOL_PROP_DEDUPRATIO, "dedupratio", 0,
  PROP_READONLY, ZFS_TYPE_POOL, "<1.00x or higher if deduped>",
  "DEDUP", B_FALSE, sfeatures);
  zprop_register_number(ZPOOL_PROP_BRTRATIO, "brtratio", 0,
  PROP_READONLY, ZFS_TYPE_POOL, "<1.00x or higher if cloned>",
  "BRT", B_FALSE, sfeatures);

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
"brtratio" is not very informative. Why not "cloneratio", for
example?

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I agree this is not the best name... I didn't want to use "clone" as
it may be confused with 'zfs clone' and blockcloneratio seems too
long. But I agree this should be changed. If people are ok with
cloneratio, then it's fine by me.

Thank you Alex for looking at this!

Sorry, something went wrong.

All reactions

Copy link
Member

@amotin amotin May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I am actually curios what does it measure and how useful? It seems
like the average number of references to each block that has more
than one. For dedup the attribute had some more sense, since it
counted all blocks in DDT, which included all blocks written since
dedup enable.

Sorry, something went wrong.

All reactions

Copy link
Contributor

@allanjude allanjude May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
blockrefratio is better, but maybe a bit long

Sorry, something went wrong.

All reactions

@pjd
Copy link
Contributor Author

pjd commented May 20, 2022

    @pjd Wow, I just noticed this (you work in stealth mode? :) ),
    surprised at the progress, thanks! I had a question related to
    what @rincebrain asked about zfs send/receive and was curious if
    you knew how BTRFS was doing their reflink send preservation?

    (I think the -c and -p options? But I've also read on email lists
    that supposedly btrfs can preserve reflinks over send/receive.)
    https://blogs.oracle.com/linux/post/
    btrfs-sendreceive-helps-to-move-and-backup-your-data

    So was wondering how they do it. thx

Sorry, but I don't know how they do it.

    How can we donate anonymously? (to a specific dev, or openzfs
    earmarked for specific developer?) Also need to know if
    "donations" will be tax-exempt for recipient or not. If they are
    not, sender should deduct as biz expense. (Personally, I want my
    unsolicited donations to be accepted as "Tax-Exempt gifts" by
    recipients, but double taxation of the same money needs to be
    avoided, so donor and gift recipient don't both pay taxes on
    donated funds, that's reason for question.)

I'd recommend donating to the FreeBSD Foundation, which is fully
tax-deductible in US and FF is supporting OpenZFS development.

 2 lin72h and devnullity reacted with thumbs up emoji
All reactions

  *  2 reactions

Sorry, something went wrong.

@jittygitty
Copy link

jittygitty commented May 21, 2022

@pjd I wish freebsd all the best, but I wanted to donate to a
'specific person', I only use linux anyway. I would not mind 'paying
taxes' on donation, without any tax-deduction, so that the individual
/recipient does not, since it is a gift. Also via the foundation it
seems you must give foundation full name/address etc, the anonymity
they offer is just not to list your name on their website.

I'll try do a bit of research to see if I can find a more direct way
than FreeBSD Foundation, or if that's the only way.

As far as btrfs reflinks over send/receive, I think they 'replay' a
sort of playbook of actions on extents, that playback on receive.
(I could be wrong since details escape me now, I'll have to take a
peek again.)

Many thanks again for all your hard work and surprising us with this
pull-request for feature many of us long wished for!

All reactions

Sorry, something went wrong.

amotin
amotin reviewed May 21, 2022
View reviewed changes
Copy link
Member

@amotin amotin left a comment

 

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Few more thoughts in order of appearance. Still reading.

Sorry, something went wrong.

All reactions

module/zfs/brt.c Outdated

  
  spa_config_enter(brt->brt_spa, SCL_VDEV, FTAG, RW_READER);
  vd = vdev_lookup_top(brt->brt_spa, brtvd->bv_vdevid);
  size = vdev_get_min_asize(vd) / brt->brt_rangesize + 1;

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Aren't you allocating here one extra if vdev_get_min_asize(vd) is
multiple of brt->brt_rangesize?

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 24, 2022 *
edited

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I am. Thanks.

Sorry, something went wrong.

All reactions

Copy link
Member

@amotin amotin May 24, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I was thinking about DIV_ROUND_UP() AKA howmany().

Sorry, something went wrong.

All reactions

module/zfs/brt.c

  brt_vdev_sync(brt, brtvd, tx);
  
  if (brtvd->bv_totalcount == 0)
  brt_vdev_destroy(brt, brtvd, tx);

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Do you need brt_vdev_sync() in case of brt_vdev_destroy()?

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 24, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
It is not strictly necessary, but in brt_vdev_destroy() I assert that
all the counters are 0. If brt_vdev_sync() wouldn't be called then we
would need to resign from the asserts and I'd prefer not to do that,
especially that this won't happen often. Do you agree?

Sorry, something went wrong.

All reactions

Copy link
Member

@amotin amotin May 24, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I just meant that writing something just to delete it next line is a
waste of time. I agree about the assertions point though, but it is
more of implementation detail.

Sorry, something went wrong.

All reactions

module/zfs/brt.c Outdated

  if (bre1->bre_offset < bre2->bre_offset)
  return (-1);
  else if (bre1->bre_offset > bre2->bre_offset)
  return (1);

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Lots if not all AVL compare functions were rewritten to TREE_CMP().

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

module/zfs/brt.c Outdated

  /*
  * Entries to sync.
  */
  avl_tree_t *bv_tree;

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Does it need to be a pointer? It is a very small structure, why add
another pointer dereference?

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

module/zfs/brt.c Outdated

  if (bre1->bre_vdevid < bre2->bre_vdevid)
  return (-1);
  else if (bre1->bre_vdevid > bre2->bre_vdevid)
  return (1);

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I see brt_entry_compare() is used only within one vdev. So I guess
you could just assert (bre1->bre_vdevid == bre2->bre_vdevid).

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 24, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Good catch! This is a leftover from when all BRT entries were
together. We can actually eliminate bre_vdevid (which I just did).
Thanks.

Sorry, something went wrong.

All reactions

module/zfs/brt.c Outdated

  
  if (error == ENOENT) {
  BRTSTAT_BUMP(brt_decref_entry_not_on_disk);
  brt_entry_free(bre);

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Allocating and immediately freeing memory for every freed block
potentially present in BRT is a bit of waste. Could probably be
optimized.

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 24, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Good point. Fixed.

Sorry, something went wrong.

All reactions

module/zfs/brt.c Outdated

  
  brt_entry_fill(&bre_search, bp);
  
  brt_enter(brt);

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I haven't profiled frees recently, but have feeling that global
pool-wide lock here will cause congestion. It would be good to have
it at least per-vdev.

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 24, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
For now I've modified brt_lock to rwlock. This allows to use
read-lock in brt_may_exists() and this is the hot function called on
every free.

Sorry, something went wrong.

All reactions

module/zfs/brt.c Outdated

  { "decref_no_entry", KSTAT_DATA_UINT64 }
  };
  
  #define BRTSTAT_BUMP(stat) atomic_inc_64(&
  brt_stats.stat.value.ui64)

Copy link
Member

@amotin amotin May 20, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Would be good to use wmsum_add() to not touch global variables with
atomic under another global (now) lock.

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

allanjude
allanjude reviewed May 24, 2022
View reviewed changes
module/zfs/brt.c Outdated

  /*
  * TODO: Walk through brtvd->bv_bitmap and write only dirty parts.
  */
  dmu_write(brt->brt_mos, brtvd->bv_mos_brtvdev, 0,

Copy link
Contributor

@allanjude allanjude May 24, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Brian pointed out this might need to be byte swapped so it works
properly across endian-ness

Sorry, something went wrong.

All reactions

Copy link
Contributor

@allanjude allanjude May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Should we have a header to go with this? to store BRT_RANGE_SIZE and
anything else that might change in the future?

Otherwise we are likely to have compatibility issues if we ever want
to change the range size

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 27, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
You won't be able to change range size once the pool is created.
Well, it at least will be hard, as you would need to reread the
entire BRT table and recreated refcount table. Currently I store
range size in metadata to be able to change the define in the future
and don't affect existing pools.

Sorry, something went wrong.

All reactions

amotin
amotin reviewed May 26, 2022
View reviewed changes
Copy link
Member

@amotin amotin left a comment

 

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I've got to the end of the patch, so the last batch for now.

Sorry, something went wrong.

All reactions

module/zcommon/zpool_prop.c Outdated

  @@ -116,6 +116,9 @@ zpool_prop_init(void)
  zprop_register_number(ZPOOL_PROP_DEDUPRATIO, "dedupratio", 0,
  PROP_READONLY, ZFS_TYPE_POOL, "<1.00x or higher if deduped>",
  "DEDUP", B_FALSE, sfeatures);
  zprop_register_number(ZPOOL_PROP_BRTRATIO, "brtratio", 0,
  PROP_READONLY, ZFS_TYPE_POOL, "<1.00x or higher if cloned>",
  "BRT", B_FALSE, sfeatures);

Copy link
Member

@amotin amotin May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I am actually curios what does it measure and how useful? It seems
like the average number of references to each block that has more
than one. For dedup the attribute had some more sense, since it
counted all blocks in DDT, which included all blocks written since
dedup enable.

Sorry, something went wrong.

All reactions

module/zfs/brt.c Show resolved Hide resolved
module/zfs/dmu.c

  dl->dr_override_state = DR_OVERRIDDEN;
  if (BP_IS_HOLE(bp)) {
  dl->dr_overridden_by.blk_phys_birth = 0;
  dl->dr_overridden_by.blk_birth = 0;

Copy link
Member

@amotin amotin May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Is this correct? This case is equivalent of hole being punched after
being written, so as I understand it should get birth time of this
TXG to replicate correctly. Exception can be made only if it was hole
and still remains a hole, in which case we may keep the pointer
intact.

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd Nov 1, 2022 *
edited

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Does that look better?

--- a/module/zfs/dmu.c
+++ b/module/zfs/dmu.c
@@ -2272,12 +2272,11 @@ dmu_brt_addref(objset_t *os, uint64_t object, uint64_t offset, uint64_t length,
                dl->dr_override_state = DR_OVERRIDDEN;
                if (BP_IS_HOLE(bp)) {
                        dl->dr_overridden_by.blk_phys_birth = 0;
-                       dl->dr_overridden_by.blk_birth = 0;
                } else {
                        dl->dr_overridden_by.blk_phys_birth =
                            BP_PHYSICAL_BIRTH(bp);
-                       dl->dr_overridden_by.blk_birth = dr->dr_txg;
                }
+               dl->dr_overridden_by.blk_birth = dr->dr_txg;

                /*
                 * When data in embedded into BP there is no need to create

Sorry, something went wrong.

All reactions

Copy link
Member

@amotin amotin Nov 1, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Probably, but I am not completely confident in this db_dirty_records
area. Additionally if both source and destination are holes, I guess
this could be skipped completely to not increase metadata and
replication traffic, but with the same caution.

Sorry, something went wrong.

All reactions

module/zfs/zfs_vnops.c

  length, RL_WRITER);
  srclr = zfs_rangelock_enter(&srczp->z_rangelock, srcoffset,
  length, RL_READER);
  }

Copy link
Member

@amotin amotin May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Should here also be a case of srczp == dstzp, comparing srcoffset to
dstoffset to avoid deadlock if somebody try to do it simultaneously
in opposite directions.

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

Copy link
Contributor Author

@pjd pjd May 27, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Good catch!

Sorry, something went wrong.

All reactions

module/zfs/zfs_vnops.c Outdated

  length, RL_READER);
  }
  
  srcblksz = srczp->z_blksz;

Copy link
Member

@amotin amotin May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I feel somewhere here must be a check that destination block size is
equal to source or the destination file consist of only one block or
it will be overwritten completely. zfs_grow_blocksize() used below
will silently do nothing otherwise, that will probably result in data
corruption.

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

module/zfs/zio.c Outdated

  return (zio);
  }
  if (BP_GET_TYPE(bp) != DMU_OT_PLAIN_FILE_CONTENTS &&
  BP_GET_TYPE(bp) != DMU_OT_ZVOL) {

Copy link
Member

@amotin amotin May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Should here be BP_IS_METADATA() instead for symmetry with
dmu_brt_readbps()?

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

module/zfs/zfs_log.c

  lr->lr_length = len;
  lr->lr_blksz = blksz;
  lr->lr_nbps = partnbps;
  memcpy(lr->lr_bps, bps, sizeof (bps[0]) * partnbps);

Copy link
Member

@amotin amotin May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Do we have some guaranties that blocks from different dataset won't
be freed before the replay? For example in case of encrypted dataset
(I just don't remember whether non-encrypted datasets are replayed
strictly on import or may be sometimes later).

Sorry, something went wrong.

All reactions

Copy link
Contributor

@allanjude allanjude May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
IIRC, they are replayed at mount, so it could be late

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd Nov 1, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
For now I changed the code to only use ZIL when we clone within a
single dataset. When sync=always is set on the destination dataset
and we are cloning between separate datasets we call txg_wait_synced
() before returning.

Sorry, something went wrong.

All reactions

Copy link
Member

@amotin amotin Nov 1, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I suspect unless the file is really big it may be faster to return
error and just copy it than wait for TXG commit.

Sorry, something went wrong.

All reactions

behlendorf
behlendorf reviewed May 27, 2022
View reviewed changes
Copy link
Contributor

@behlendorf behlendorf left a comment

 

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I've only started working my way through the PR, but I thought I'd go
ahead and post at least some initial feedback. I'll keep working my
through it as I can. For other reviewer I'd suggest starting with the
Big Theory comment at the top if brt.c. It does a great job
explaining the high level design.

Sorry, something went wrong.

All reactions

cmd/zdb/zdb_il.c Outdated

  tab_prefix, (u_longlong_t)lr->lr_foid, (u_longlong_t)lr->lr_offset,
  (u_longlong_t)lr->lr_length, (u_longlong_t)lr->lr_blksz);
  
  for (i = 0; i < lr->lr_nbps; i++) {

Copy link
Contributor

@behlendorf behlendorf May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Suggested change

 for (i = 0; i < lr->lr_nbps; i++) {
 for (unsigned int i = 0; i < lr->lr_nbps; i++) {

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

include/sys/ddt.h Outdated Show resolved Hide resolved
module/icp/include/sys/bitmap.h Outdated

  @@ -0,0 +1,183 @@
  /*

Copy link
Contributor

@behlendorf behlendorf May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Is there a reason this needs to be added to the icp include
directory? It'd be a bit more convenient to place it under the top
level include/sys/ path. It would also be nice to either prune the
header down to what's actually used, or add the missing bitmap.c
source so all the publicly defined functions exist.

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

Copy link
Contributor

@behlendorf behlendorf Feb 24, 2023

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
This still should be moved. It builds fine on Linux under include/sys
/.

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

module/zfs/brt.c Outdated Show resolved Hide resolved
module/zfs/brt.c Outdated

  * algorithm or even with checksumming disabled.
  *
  * As mentioned above, the BRT entries are much smaller than the DDT
  entries.
  * To uniquely identify a block we just need its vdevid and offset.
  We also

Copy link
Contributor

@behlendorf behlendorf May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I'd suggest using vdev id instead of vdevid throughout.

Sorry, something went wrong.

 1 pjd reacted with thumbs up emoji
All reactions

  *  1 reaction

module/zfs/brt.c Outdated Show resolved Hide resolved
module/zfs/brt.c Outdated Show resolved Hide resolved
module/zfs/brt.c Outdated

  * references? To avoid this dilemma BRT cooperates with DDT - if a
  given block
  * is being cloned using BRT and the BP has the D (dedup) bit set,
  BRT will
  * lookup DDT entry and increase the counter there. No BRT entry
  will be
  * created for a block that resides on a dataset with deduplication
  turned on.

Copy link
Contributor

@behlendorf behlendorf May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I think you mean no BRT entry will be created for a block that is
already in the dedup table, right? Not that this is based on the
current setting of the dedup property. It seems to me this also
requires that once a block has been added to the BRT it can't be
added to the DDT. Or put another way, the contents of the BRT and DDT
are required to be mutually exclusive.

Sorry, something went wrong.

All reactions

Copy link
Contributor Author

@pjd pjd May 27, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I'll try to clarify that. You are right that we may have a block
residing on dedup-enabled dataset and not having the D bit set (ie.
it was created before dedup was turned on).
Also note that it is impossible to set the D bit after block
creation, so "once a block has been added to the BRT it can't be
added to the DDT" cannot happen as there is no way to add a block to
DDT after it was created.

Sorry, something went wrong.

All reactions

module/zfs/brt.c

  * ssize_t fclonefile(int srcfd, int dstfd);
  * ssize_t fclonerange(int srcfd, off_t srcoffset, size_t length,
  * int dstfd, off_t dstoffset);
  *

Copy link
Contributor

@behlendorf behlendorf May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
For reference, on Linux the interfaces are:

   ssize_t copy_file_range(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out, size_t len, unsigned int flags);
   int ioctl(int dest_fd, FICLONERANGE, struct file_clone_range *arg);
   int ioctl(int dest_fd, FICLONE, int src_fd);

https://man7.org/linux/man-pages/man2/copy_file_range.2.html
https://man7.org/linux/man-pages/man2/ioctl_ficlone.2.html

At least on Linux there's no requirement that the start and length
offsets need to be block aligned. If unaligned we'll have to copy the
beginning and end at least on Linux. My feeling is it'd be best to
handle this in the common code so all the platforms could benefit
from it. But if could be platform specific if we need too.

I haven't yet gone through all the expected errno's from the man
pages above, but let's make sure they match up with what
zfs_clone_range() returns if possible.

Sorry, something went wrong.

All reactions

module/zfs/brt.c Outdated

  * - The block we want to clone may have been modified within the
  same
  * transaction group. We could potentially clone the previous
  version of the
  * data, but that doesn't seem right. We treat it as the previous
  case and
  * return an error.

Copy link
Contributor

@behlendorf behlendorf May 26, 2022

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
For this case, and the one above I think we're going to have to
handle waiting on the txg sync and retrying somewhere in the ZFS
code. Based on established errno's from the man pages any existing
callers aren't going to know what to do with EAGAIN.

Sorry, something went wrong.

 2 pjd and edillmann reacted with thumbs up emoji
All reactions

  *  2 reactions

@rincebrain rincebrain mentioned this pull request May 28, 2022
Fast Copy Between Related Datasets #13516
Closed
@Haravikk
Copy link

Haravikk commented May 28, 2022

Excellent to see this feature being worked on! I just posted a
request for "fast-copying" between datasets (#13516) because I had
assumed any reflink style support wouldn't initially handle it, but
it's good to see it mentioned here. In that case my issue can
probably be closed.

To be clear, in my issue I'm assuming automatic "fast-copying"
(cloning) rather than having to explicitly call cp --reflink or
similar, because a lot of file copying/moving is done by programs
that won't know to do this, so it's better for it to be automatic.
For move/copy between datasets though this will obviously depend upon
zfs knowing when that's the case (I'm not super familiar with the
system calls themselves so don't know how easy that would be to
detect?), but clearly it would be better if ZFS recognises as many
cases where cloning is beneficial as possible, without having to be
told.

One thing I'd like to discuss from my issue is copying between
datasets where certain settings differ, with the main ones being
copies, compression and recordsize.

If a target dataset has a copies setting that differs from the source
then ideally copies should be created or ignored to match. So if the
source has copies=2 and the target is copies=1 then the second copy
won't be cloned, while if we flip that around (copies=1 -> copies=2)
then we still need to create a new copy in addition to cloning the
first, so that we don't end up with "new" data that is less redundant
than an ordinary copy would have been. I didn't see any mention of
copies being discussed, and can't tell if this is already implemented
or not.

Meanwhile compression and recordsize represent settings that will
mean that a cloned file will not be an exact match for the file as it
would have been had it been copied normally, as the source dataset
may not have compression but the target does, target may use a larger
recordsize and so-on.

To cover these I propose a new setting, probably better named
blockcloneto for this feature, which will control cloning between
datasets. I proposed three basic settings:

  * on: block-cloning is always used where possible when this dataset
    is the target.
  * off: block-cloning is never used (files are copied normally in
    all cases).
  * exact: block-cloning is used when compression and recordsize for
    the source dataset matches the target.

There may be other settings that need to be considered, but these are
the ones I use that seem applicable to cloning.

The aim is to allow users to ensure that cloning is not used to
produce files in a dataset that are less redundant, less compressed
etc. than files that are copied normally.

All reactions

Sorry, something went wrong.

@rincebrain
Copy link
Contributor

rincebrain commented May 29, 2022

So, I tried a first pass at wiring up the Linux copy/clone range
calls, and found a problem.

You see, Linux has a hardcoded check in the VFS layer for each of
these calls that:

        if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
                return -EXDEV;

That is, it attempts to explicitly forbid cross-filesystem reflinks
in a VFS check before it ever gets to our code.

The options I see for dealing with this are all pretty gross.

  * Lie about having the same i_sb for all datasets on a pool, and
    work around doing this anywhere we actually consume this value
    (eww, and seems like it could have weird side effects)
  * Write our own ioctl reimplementing the equivalent functionality
    of copy_file_range, without this restriction (ewwww, cp et al
    wouldn't use it so we'd need zfs_cp or similar, though we might
    be able to eventually convince coreutils to support it
    directly...maybe...)
  * I don't really have any other suggestions.

Thoughts?

All reactions

Sorry, something went wrong.

@Haravikk
Copy link

Haravikk commented May 29, 2022

    The options I see for dealing with this are all pretty gross.

      + Lie about having the same i_sb for all datasets on a pool,
        and work around doing this anywhere we actually consume this
        value (eww, and seems like it could have weird side effects)
      + Write our own ioctl reimplementing the equivalent
        functionality of copy_file_range, without this restriction
        (ewwww, cp et al wouldn't use it so we'd need zfs_cp or
        similar, though we might be able to eventually convince
        coreutils to support it directly...maybe...)
      + I don't really have any other suggestions.

If the aim is to clone as many copy/move operations as possible then
should we be worrying about explicit cloning commands in the first
place?

I think the ideal scenario is to avoid the need for those calls
entirely, by detecting when a file is being copied between datasets
in the same pool (including the same dataset) so we can use cloning
instead, this way programs don't need to make any explicit cloning
calls, since many don't/won't, so picking this up automatically on
the ZFS side would be best, as it would mean more files
"deduplicated" with no extra work by users or programs.

All reactions

Sorry, something went wrong.

@rincebrain
Copy link
Contributor

rincebrain commented May 29, 2022

        The options I see for dealing with this are all pretty gross.

          o Lie about having the same i_sb for all datasets on a
            pool, and work around doing this anywhere we actually
            consume this value (eww, and seems like it could have
            weird side effects)
          o Write our own ioctl reimplementing the equivalent
            functionality of copy_file_range, without this
            restriction (ewwww, cp et al wouldn't use it so we'd need
            zfs_cp or similar, though we might be able to eventually
            convince coreutils to support it directly...maybe...)
          o I don't really have any other suggestions.

    If the aim is to clone as many copy/move operations as possible
    then should we be worrying about explicit cloning commands in the
    first place?

    I think the ideal scenario is to avoid the need for those calls
    entirely, by detecting when a file is being copied between
    datasets in the same pool (including the same dataset) so we can
    use cloning instead, this way programs don't need to make any
    explicit cloning calls, since many don't/won't, so picking this
    up automatically on the ZFS side would be best, as it would mean
    more files "deduplicated" with no extra work by users or
    programs.

Changing the scope of this from "userland generated reflinking of
ranges" to "reimplementing dedup inline" would significantly impact
how useful it is, IMO. In particular, it would go from "I'd use it"
to "I'd force it off".

The goal here is not (I believe) "maximize dedup on everything", the
goal here is "allow userland to specify when something is wanted to
be a CoW copy of something else, because that's where most of the
benefit is compared to trying to dedup everything".

 6 lukts30, modzilla99, adamdmoss, scineram, Ranlvor, and thesamesam
reacted with thumbs up emoji
All reactions

  *  6 reactions

Sorry, something went wrong.

@Haravikk
Copy link

Haravikk commented May 29, 2022 *
edited

    Changing the scope of this from "userland generated reflinking of
    ranges" to "reimplementing dedup inline" would significantly
    impact how useful it is, IMO. In particular, it would go from
    "I'd use it" to "I'd force it off".

This seems like a huge overreaction; ZFS provides more than enough
tools to control redundancy already, so there's simply no advantage
to maintaining multiple copies of the same file beyond that (so long
as the target dataset's settings are respected if necessary, see
above).

I'm very much of the opposite position; if this feature is manual
only then I guarantee you I will almost never use it, because so many
tools simply do not support cloning even on filesystems that support
it. If I have to jump to the command line to do it manually outside
of a tool, or discard and replace copies after the fact, then that's
an excellent way to guarantee that I'll almost never do it.

While there's certainly potential for offline deduplication scripts
to make this a bit easier, e.g- scan one or more datasets for
duplicates and replace with clones of one of them, why force it to be
done retroactively? Otherwise it's limiting to just those with more
specific needs, like cloning VMs or similar (where cloning datasets
isn't suitable).

And if the cloning system calls aren't going to work for
cross-dataset cases anyway, then automatic may be the only way to go
regardless which is why I mentioned it; at the very least you could
still turn it off by default, then turn it on temporarily when you
want to clone to another dataset.

 3 modzilla99, scineram, and Ranlvor reacted with thumbs down emoji
All reactions

  *  3 reactions

Sorry, something went wrong.

@rincebrain
Copy link
Contributor

rincebrain commented May 29, 2022

I don't know if I'd say they're not going to work...

$ ls -ali /turbopool/whatnow/bigfile2 /turbopool/whatnow2/bigfile3
ls: cannot access '/turbopool/whatnow2/bigfile3': No such file or directory
3 -rw-r--r-- 1 root root 10737418240 May 28 12:20 /turbopool/whatnow/bigfile2
$ sudo cmd/clonefile/clonefile /turbopool/whatnow/bigfile2 /turbopool/whatnow2/bigfile3
$ sudo cmd/zdb/zdb -dbdbdbdbdbdb turbopool/whatnow 3 > /tmp/file1
$ ls -ali /turbopool/whatnow/bigfile2 /turbopool/whatnow2/bigfile3
128 -rw-r--r-- 1 root root 10737418240 May 29 17:02 /turbopool/whatnow2/bigfile3
  3 -rw-r--r-- 1 root root 10737418240 May 28 12:20 /turbopool/whatnow/bigfile2
$ sudo cmd/zdb/zdb -dbdbdbdbdbdb turbopool/whatnow2 128 > /tmp/file2
$ grep ' 200000   L0' /tmp/file{1,2}
/tmp/file1:          200000   L0 DVA[0]=<0:15800223c00:20000> [L0 ZFS plain file] edonr uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=377L/377P fill=1 cksum=1746b656272237d9:cd37f4f0b1f655f5:699bc3e57a9d0e06:72ebf1ea28603be2
/tmp/file2:          200000   L0 DVA[0]=<0:15800223c00:20000> [L0 ZFS plain file] edonr uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=4158L/377P fill=1 cksum=1746b656272237d9:cd37f4f0b1f655f5:699bc3e57a9d0e06:72ebf1ea28603be2
$ df -h /turbopool/whatnow2/bigfile3 /turbopool/whatnow/bigfile2
Filesystem          Size  Used Avail Use% Mounted on
turbopool/whatnow2  1.8T   21G  1.8T   2% /turbopool/whatnow2
turbopool/whatnow   1.8T   41G  1.8T   3% /turbopool/whatnow

It's more inconvenient, but it's certainly not unworkable.

I'm not saying there's no use for anyone if it's used to replace the
existing inline dedup implementation - further up the thread, I
advocate for doing just that, and implementing similar functionality
to allow you to do inline dedup on send/recv with this.

But personally, the performance tradeoffs of inline dedup aren't
worth it to me, especially compared to doing it as postprocessing
later as needed, or with explicit notification, so if that was the
only functionality, I would not be making use of it.

All reactions

Sorry, something went wrong.

@lukts30
Copy link

lukts30 commented May 29, 2022 *
edited

    So, I tried a first pass at wiring up the Linux copy/clone range
    calls, and found a problem.

    You see, Linux has a hardcoded check in the VFS layer for each of
    these calls that:

My understanding from reading the man page and kernel source code is
that this is indeed a problem with all ioctl-based file clone/dedup
calls but not with the newer more generic copy_file_range syscall.
copy_file_range explicitly has supports for cross-filesystem copies
since 5.3.

If the src and dst file are from zfs but not on the same sb with
copy_file_range zfs still has the chance to implement
"copy acceleration" techniques.

torvalds/linux@5dae222

---------------------------------------------------------------------

Still not optimal since most tools that claim to be reflink
compatible will first use the ioctl and only might fall back to
copy_file_range.
E.g. gnu coreutil cp and syncthing would still try copy_file_range
but other tools might not.

https://dev.to/albertzeyer/
difference-of-ficlone-vs-ficlonerange-vs-copyfilerange-for-copy-on-write-support-41lm

All reactions

Sorry, something went wrong.

@rincebrain
Copy link
Contributor

rincebrain commented May 29, 2022 *
edited

coreutils (8.30) on 5.4 was where I saw EXDEV without ever executing
our copy_file_range or remap_file_range, and looking in the source,
8.30 doesn't know about copy_file_range at all.

So even if the kernel provides copy_file_range, plenty of older
platforms aren't going to play, unfortunately.

(Not that I'm not ecstatic to be wrong and that we can have our cake
and eat it too, and also amused that Oracle originated the patch, but
I think we still get to have a fallback, albeit a much simpler one
than what I implemented, for those cases.)

e: anyway, i'll polish up the thing I have for making this play nice
under Linux with cp --reflink and extending clonefile to use
copy_file_range, and then I'll post it probably tomorrow for people
to use or not as they like.

e: I actually just tried coreutils git and it also refused cp
--reflink with EXDEV, so I'm guessing their autodetection handling
is...incorrect, because actually calling copy_file_range works great.

[?] 1 leelists reacted with heart emoji
All reactions

  * [?] 1 reaction

Sorry, something went wrong.

@jittygitty jittygitty mentioned this pull request May 29, 2022
Support for ZFS on top of VDO inline deduplication and best practices
etc? #13462
Open
@rincebrain
Copy link
Contributor

rincebrain commented Jun 18, 2022

Huh, did I really never post anything more in here? Rude of me.

It seems the consensus is Linux should change here, and that
coreutils can't do anything to make the behavior I write about there
better without violating correctness, but I put the odds of that as
infinitely approaching 0 without quite getting there, so I suppose
we'll be stuck with clonefile and/or documenting cp --reflink=always
not behaving as expected, unless there's some detail I'm overlooking.

I've been busy and haven't done what I wanted to, which was extend
the clonefile command to do smaller than full range copies and write
a bunch of tests to exercise that and report back. Maybe I'll get
that done in the next few days, we'll see.

 1 adamdmoss reacted with thumbs up emoji
All reactions

  *  1 reaction

Sorry, something went wrong.

@IvanVolosyuk
Copy link

IvanVolosyuk commented Jun 19, 2022

      + Lie about having the same i_sb for all datasets on a pool,
        and work around doing this anywhere we actually consume this
        value (eww, and seems like it could have weird side effects)

i_sb is a superblock of filesystem. Do you think it is actually
lying, if the data can be linked freely across the pool? I wonder if
it will be interesting to find btrfs discussion on the topic - why
they implemented it that way.

 1 mqudsi reacted with thumbs up emoji
All reactions

  *  1 reaction

Sorry, something went wrong.

@rincebrain
Copy link
Contributor

rincebrain commented Jun 19, 2022

It's probably worth reading about some side effects of how btrfs
handles subvolumes before thinking that might be a good route.

 2 IvanVolosyuk and Haravikk reacted with thumbs up emoji
All reactions

  *  2 reactions

Sorry, something went wrong.

@Haravikk
Copy link

Haravikk commented Jun 19, 2022 *
edited

So if leveraging the system clone call isn't an option, what
information do we have to work with?

My thinking was that if we can know the source dataset, target
dataset, and the file/inode being copied or moved, then this should
still be enough for ZFS to decide whether to clone or not, without
having to be explicitly told?

I don't know enough about the file system calls to know how easy it
is to get that information though; I guess I was just thinking that
if we know a file is being copied within the same dataset, or copied/
moved between two datasets under the same (or no) encryption root
then ZFS can simply decide to clone automatically based on whatever
rules are set on the target (e.g- always clone where possible, only
clone if compression/recordsize etc. is the same, or never clone).

It's never really made sense to me why an explicit call should be
required to clone in the first place, because if you trust a
filesystem enough to store your data then you shouldn't need to
create extra redundancy by default, especially on ZFS with mirror or
raidz vdevs, copies=2 etc., if those aren't enough redundancy for
you, then you could still disable cloning (either permanently or
temporarily).

It should still probably be an opt-in feature, as administrators know
their pools the best, but the three main options I outlined should be
sufficient? Then if the clone system calls are improved in future, we
could add a fourth setting "on" meaning cloning is permitted, but
only when the appropriate system call is used? Actually thinking
about it, this option could also be available immediately, as even if
the system calls won't work for now, we could still presumably have a
ZFS specific command to begin with (e.g- something under zdb) to
handle the cloning entirely via ZFS? Not ideal, but for those that
want manual cloning only, it would still make it possible in the
interim.

All reactions

Sorry, something went wrong.

@lukts30
Copy link

lukts30 commented Jun 19, 2022

    My thinking was that if we can know the source dataset, target
    dataset, and the file/inode being copied or moved, then this
    should still be enough for ZFS to decide whether to clone or not,
    without having to be explicitly told?

    I don't know enough about the file system calls to know how easy
    it is to get that information though;

The whole reason why reflink exists and why it is is an explicit
operation is because:

    To the filesystem, a cp isn't a copy -- it's one process reading
    from one file and writing to another. Figuring out that that is
    supposed to be a copy is very non-trivial and expensive,
    especially when taking into account metadata operations which
    aren't part of the regular file stream.
    https://lwn.net/Articles/332076/

And IMHO if you want to do things implicitly ZFS already has dedupe.

Block Cloning/ Reflink is purposely built around being an explicit
operation so that the kernel knows of the intention of userspace.
There is no guessing game that some open read and write syscall all
correlate together.

 2 adamdmoss and scineram reacted with thumbs up emoji
All reactions

  *  2 reactions

Sorry, something went wrong.

@Haravikk Haravikk mentioned this pull request Jun 19, 2022
Lightweight Deduplication (for Copying) #13572
Open
problame
problame suggested changes Jun 22, 2022
View reviewed changes
Copy link
Contributor

@problame problame left a comment

 

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
I did another ZIL-focussed review.
Apologies for the formatting, I did it offline and then copy-pasted
into GitHub.

Sorry, something went wrong.

 2 adamdmoss and lin72h reacted with thumbs up emoji
All reactions

  *  2 reactions

module/zfs/zfs_log.c Outdated Show resolved Hide resolved
module/zfs/zfs_replay.c Show resolved Hide resolved
module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved
module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved
module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved
module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved
module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved
module/zfs/zfs_vnops.c Show resolved Hide resolved
31 hidden items Load more...
@pjd pjd force-pushed the brt branch from 7e453cc to 7079d3d Compare 
March 8, 2023 23:06
@pjd
Implementation of block cloning for ZFS. ...
32b97d2

Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).

The whole design of block cloning is documented in module/zfs/brt.c.

Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

@pjd pjd force-pushed the brt branch from 1f34686 to 32b97d2 Compare 
March 9, 2023 07:01
behlendorf
behlendorf approved these changes Mar 10, 2023
View reviewed changes
Copy link
Contributor

@behlendorf behlendorf left a comment

 

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.
Learn more.

[Choose a reason] Hide comment
Thanks for working through feedback! This looks ready to me. We've
got a few changes we'll need to follow up with, but I don't think
they need to hold the rest of this up.

 1. We'll want to further address the ZIL replay issue mentioned in
    the feedback and commented on above zvol_replay_clone_range. But
    that can be handled in its own PR.
 2. This is wired up for FreeBSD, but not yet for Linux. Again there
    are some platform specific details there which can be worked out
    in a follow up PR.

Sorry, something went wrong.

All reactions

module/zfs/zil.c Show resolved Hide resolved
@behlendorf behlendorf added Status: Accepted Ready to integrate
(reviewed, tested) and removed Status: Code Review Needed Ready for
review and testing labels Mar 10, 2023
Hide details View details @behlendorf behlendorf merged commit
67a1b03 into openzfs:master Mar 10, 2023
14 of 16 checks passed
@adamdmoss
Copy link
Contributor

adamdmoss commented Mar 10, 2023 *
edited

Apologies since I lost track; did this land in a state compatible
with existing Linux copy_file_range/reflink support (i.e. cp
--reflink) or is a separate flag/syscall/api needed to utilize this?
Thanks!

 2 Alanaktion and edillmann reacted with eyes emoji
All reactions

  *  2 reactions

Sorry, something went wrong.

@amotin
Copy link
Member

amotin commented Mar 10, 2023

    did this land in a state compatible with existing Linux reflink
    support (i.e. cp --reflink) or is a separate flag/syscall/api
    needed to utilize this? Thanks!

@adamdmoss It landed without Linux support yet. Only FreeBSD for now,
but hopefully not for long.

All reactions

Sorry, something went wrong.

@adamdmoss
Copy link
Contributor

adamdmoss commented Mar 10, 2023

        did this land in a state compatible with existing Linux
        reflink support (i.e. cp --reflink) or is a separate flag/
        syscall/api needed to utilize this? Thanks!

    @adamdmoss It landed without Linux support yet. Only FreeBSD for
    now, but hopefully not for long.

Cool - thanks for the clarification.

All reactions

Sorry, something went wrong.

@cyphar
Copy link
Contributor

cyphar commented Mar 12, 2023

Is anyone else already working on wiring this up on Linux? I'd be
willing to give it burl if no-one else is working on it.

[?] 2 thesamesam and edillmann reacted with heart emoji
All reactions

  * [?] 2 reactions

Sorry, something went wrong.

@ryao
Copy link
Contributor

ryao commented Mar 13, 2023

@cyphar Currently, @rincebrain is doing some experiments with it.

All reactions

Sorry, something went wrong.

@amotin
Copy link
Member

amotin commented Mar 13, 2023

We also have a guy looking into it.

[?] 2 thesamesam and edillmann reacted with heart emoji
All reactions

  * [?] 2 reactions

Sorry, something went wrong.

mcmilk pushed a commit to mcmilk/zfs that referenced this pull
request Mar 13, 2023
@pjd @mcmilk
Implementation of block cloning for ZFS ...
913e1bb

Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).

The whole design of block cloning is documented in module/zfs/brt.c.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes openzfs#13392

lundman pushed a commit to openzfsonwindows/openzfs that referenced
this pull request Mar 16, 2023
@pjd @lundman
Implementation of block cloning for ZFS ...
62f5d52

Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).

The whole design of block cloning is documented in module/zfs/brt.c.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes openzfs#13392

@jittygitty jittygitty mentioned this pull request Mar 28, 2023
FAST-Tracking REFLINK and Offline Deduplication, first for LINUX only
#13349
Open
@Majiir Majiir mentioned this pull request Apr 4, 2023
Feature Request - online split clone #2105
Open
@mjguzik
Copy link
Contributor

mjguzik commented Apr 11, 2023

Given the above commentary I take it this was not tested on Linux?

We got reports of data and pool corruption in FreeBSD after the
import which landed the feature, for example:

    Initially I copied my /usr/obj from my two build machines (one
    amd64.amd64 and an i386.i386) to my "sandbox" zpool.

    Next, with block_cloning disabled I did cp -R of the /usr/obj
    test files. Then a diff -qr. They source and target directories
    were the same.

    Next, I cleaned up (rm -rf) the target directory to prepare for
    the
    block_clone enabled test.

    Next, I did zpool checkpoint t. After this, zpool upgrade t. Pool
    t now has block_cloning enabled.

    I repeated the cp -R test from above followed by a diff -qr.
    Almost
    every file was different. The pool was corrupted.

    I restored the pool by the following removing the corruption:

    slippy# zpool export t
    slippy# zpool import --rewind-to-checkpoint t
    slippy#

    It is recommended that people avoid upgrading their zpools until
    the
    problem is fixed.

There is also a report claiming corruption can show up even with the
feature disabled, I don't know if it is related yet.

 1 grahamperrin reacted with eyes emoji
All reactions

  *  1 reaction

Sorry, something went wrong.

@ryao
Copy link
Contributor

ryao commented Apr 11, 2023

    Given the above commentary I take it this was not tested on
    Linux?

It was not. There are also no test cases in ZTS for this. We really
should get those implemented sooner rather than later.

A few bugs have already been found in this from a mix of static
analysis and people attempting to adapt it for use on Linux.

 1 grahamperrin reacted with eyes emoji
All reactions

  *  1 reaction

Sorry, something went wrong.

@mjguzik
Copy link
Contributor

mjguzik commented Apr 11, 2023

Given the above I don't think the feature is ready for prime-time and
consequently should be reverted for the time being.

All reactions

Sorry, something went wrong.

@ryao
Copy link
Contributor

ryao commented Apr 11, 2023

The ones reported have already been fixed:

ce0e1cc
5b5f518

That said, we still need test cases in the ZTS for this.

 1 lin72h reacted with thumbs up emoji
All reactions

  *  1 reaction

Sorry, something went wrong.

@mjguzik
Copy link
Contributor

mjguzik commented Apr 11, 2023

we have these commits, they came with the merge. as noted above,
issues are still there.

All reactions

Sorry, something went wrong.

@thesamesam
Copy link

thesamesam commented Apr 11, 2023

Let's start with a bug report in a separate issue.

 2 ryao and edillmann reacted with thumbs up emoji
All reactions

  *  2 reactions

Sorry, something went wrong.

@pjd
Copy link
Contributor Author

pjd commented Apr 11, 2023

Let me summarize what I found, as a lot of FUD is being spread.

Reported corruptions when block cloning was disabled, AFAIU, were due
to EXDEV leaking to userland's cp(1) and not being handled in the
kernel. This resulted in cp(1) creating empty files. Not a data
corruption inside ZFS. This was due to OpenZFS being merged to
FreeBSD before plumbing was finished:
https://reviews.freebsd.org/D38803
@mjguzik should know that, as he was involved in the review.

I can confirm there is data corruption when block cloning is enabled.
It happens for very small files were embedded blocks are used. I'm
overwriting blk_phys_birth which for embedded blocks is part of the
payload.
The corrupted file has 8 bytes overwritten at position 60. zpool
status -v might report checksum errors in those files. After removing
problematic files and running scrub everything comes back to normal.
I've created pull request with a fix:
#14739

I'm also working on adding tests, but I'd like to be able to have a
more complete coverage and currently there is no userland tool apart
from cp(1) that uses copy_file_range(2). Maybe integrating pjdfstest
would be beneficial? I could extend fstest tool to allow testing
copy_file_range(2) with different offsets and sizes.

I hope that helps.

PS. I'm currently in the middle of spring break with my family,
mostly traveling, so I can answer with delays. If someone wants to
commit the fix to FreeBSD directly, be my guest.

 8 jittygitty, olegsidorkin, lin72h, behlendorf, ryao, grahamperrin,
thesamesam, and edillmann reacted with thumbs up emoji
All reactions

  *  8 reactions

Sorry, something went wrong.

behlendorf pushed a commit that referenced this pull request Apr 12,
2023
@pjd
Fix data corruption when cloning embedded blocks ...
c71fe71

Don't overwrite blk_phys_birth, as for embedded blocks it is part of
the payload.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Issue #13392
Closes #14739

@edillmann
Copy link
Contributor

edillmann commented Apr 24, 2023

Hi @pjd,

Any new on the porting of this to linux (client side) ?

Thanks

All reactions

Sorry, something went wrong.

andrewc12 pushed a commit to andrewc12/openzfs that referenced this
pull request Apr 30, 2023
@pjd @andrewc12
Fix data corruption when cloning embedded blocks ...
3d61173

Don't overwrite blk_phys_birth, as for embedded blocks it is part of
the payload.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Issue openzfs#13392
Closes openzfs#14739

andrewc12 pushed a commit to andrewc12/openzfs that referenced this
pull request Apr 30, 2023
@pjd @andrewc12
Fix data corruption when cloning embedded blocks ...
56405ae

Don't overwrite blk_phys_birth, as for embedded blocks it is part of
the payload.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Issue openzfs#13392
Closes openzfs#14739

behlendorf added a commit that referenced this pull request Jun 30,
2023
@behlendorf
Tag 2.2.0-rc1 ...
009d328

New features:
- Fully adaptive ARC eviction (#14359)
- Block cloning (#13392)
- Scrub error log (#12812, #12355)
- Linux container support (#14070, #14097, #12263)
- BLAKE3 Checksums (#12918)
- Corrective "zfs receive" (#9372)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

@darkdragon-001 darkdragon-001 mentioned this pull request Jul 1,
2023
COW cp (--reflink) support #405
Open
Sign up for free to join this conversation on GitHub. Already have an
account? Sign in to comment
Reviewers

@allanjude allanjude  allanjude left review comments

@amotin amotin  amotin approved these changes

@adamdmoss adamdmoss  adamdmoss left review comments

@problame problame  problame requested changes

@behlendorf behlendorf  behlendorf approved these changes

Assignees
No one assigned
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

None yet

17 participants
@pjd @lin72h @rincebrain @jittygitty @Haravikk @lukts30 @IvanVolosyuk
@ryao @edillmann @adamdmoss @amotin @cyphar @mjguzik @thesamesam 
@behlendorf @problame @allanjude
Add this suggestion to a batch that can be applied as a single
commit. This suggestion is invalid because no changes were made to
the code. Suggestions cannot be applied while the pull request is
closed. Suggestions cannot be applied while viewing a subset of
changes. Only one suggestion per line can be applied in a batch. Add
this suggestion to a batch that can be applied as a single commit. 
Applying suggestions on deleted lines is not supported. You must
change the existing code in this line in order to create a valid
suggestion. Outdated suggestions cannot be applied. This suggestion
has been applied or marked resolved. Suggestions cannot be applied
from pending reviews. Suggestions cannot be applied on multi-line
comments. Suggestions cannot be applied while the pull request is
queued to merge.

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.