From ilepore@damnhippie.dyndns.org  Wed Mar  2 22:06:34 2011
Return-Path: <ilepore@damnhippie.dyndns.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 980BB106566C
	for <freebsd-gnats-submit@freebsd.org>; Wed,  2 Mar 2011 22:06:34 +0000 (UTC)
	(envelope-from ilepore@damnhippie.dyndns.org)
Received: from qmta03.emeryville.ca.mail.comcast.net (qmta03.emeryville.ca.mail.comcast.net [76.96.30.32])
	by mx1.freebsd.org (Postfix) with ESMTP id 782CB8FC14
	for <freebsd-gnats-submit@freebsd.org>; Wed,  2 Mar 2011 22:06:34 +0000 (UTC)
Received: from omta19.emeryville.ca.mail.comcast.net ([76.96.30.76])
	by qmta03.emeryville.ca.mail.comcast.net with comcast
	id EMqY1g0011eYJf8A3MtQ93; Wed, 02 Mar 2011 21:53:24 +0000
Received: from damnhippie.dyndns.org ([24.8.232.202])
	by omta19.emeryville.ca.mail.comcast.net with comcast
	id EMtL1g00R4NgCEG01MtMzK; Wed, 02 Mar 2011 21:53:22 +0000
Received: from revolution.hippie.lan (revolution.hippie.lan [172.22.42.240])
	by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id p22LrILY092795
	for <FreeBSD-gnats-submit@freebsd.org>; Wed, 2 Mar 2011 14:53:18 -0700 (MST)
	(envelope-from ilepore@damnhippie.dyndns.org)
Received: (from ilepore@localhost)
	by revolution.hippie.lan (8.14.4/8.14.4/Submit) id p22LrIhP097307;
	Wed, 2 Mar 2011 14:53:18 -0700 (MST)
	(envelope-from ilepore)
Message-Id: <201103022153.p22LrIhP097307@revolution.hippie.lan>
Date: Wed, 2 Mar 2011 14:53:18 -0700 (MST)
From: Ian Lepore <freebsd@damnhippie.dyndns.org>
Reply-To: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: [patch] MMC/SD IO slow on Atmel ARM with modern large SD cards
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         155214
>Category:       arm
>Synopsis:       [patch] MMC/SD IO slow on Atmel ARM with modern large SD cards
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-arm
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Mar 02 22:10:10 UTC 2011
>Closed-Date:    Mon Feb 24 07:40:32 MST 2014
>Last-Modified:  Mon Feb 24 07:40:32 MST 2014
>Originator:     Ian Lepore <freebsd@damnhippie.dyndns.org>
>Release:        FreeBSD 8.2-RC3 arm
>Organization:
none
>Environment:
FreeBSD dvb 8.2-RC3 FreeBSD 8.2-RC3 #49: Tue Feb 15 22:52:14 UTC 2011     root@revolution.hippie.lan:/usr/obj/arm/usr/src/sys/DVB  arm

Included patch is against -current even though the problem was first seen on
8.2-RC3

The problem was seen on AT91RM9200 hardware, but presumably also affects the
SAM9 series which uses the same driver code.

>Description:
With the latest generation of large-capacity SD cards, write speeds as low as
20 kbytes/sec are seen.  These modern cards have erase-block sizes as large as 
8192K (compared to 32K typical on previous generations).  The at91_mci driver 
does only single-sector IO; apparently this requires the SD card to internally 
perform an expensive read-erase-modify-write cycle for each 512 byte block 
written to the card.

It should be noted that even with older SD cards that have smaller erase-block
sizes, write throughput to the card rarely exceeds about 350 kbytes/sec; still
very slow for modern hardware.

With the patch provided, write throughput improves to approximately 
1500 kbytes/sec.  Read speeds also improve (was 640, now 2400 kbytes/sec).

>How-To-Repeat:
The slow IO can be demonstrated using dd to write directly to the device
(bypassing buffering and caching at higher layers):

dvb# dd if=/dev/zero of=/dev/mmcsd0s2a bs=64k count=100
100+0 records in
100+0 records out
6553600 bytes transferred in 266.252838 secs (24614 bytes/sec)

dvb# dd if=/dev/zero of=/dev/mmcsd0s2a bs=6400k count=1
1+0 records in
1+0 records out
6553600 bytes transferred in 265.004883 secs (24730 bytes/sec)

>Fix:
The following patch adds support for multi-block IO to the at91_mci driver.

The patch is to -current even though the problem was discovered in 8.2-RC3.
The patched driver will also work in 8.2, but requires a few other items in
the arm/at91 directory to be back-ported as well (to support references to
things such as at91_master_clock and at91_is_rm9200)).

--- patch-arm-at91_mci-for-multiblock begins here ---
--- sys/arm/at91/at91_mci.c.cvs_v1.18	2011-02-28 13:43:54.000000000 -0700
+++ sys/arm/at91/at91_mci.c	2011-03-02 07:21:38.000000000 -0700
@@ -67,7 +67,13 @@
 
 #include "opt_at91.h"
 
-#define BBSZ	512
+#ifndef AT91_MCI_MAX_BLOCKS
+#define AT91_MCI_MAX_BLOCKS 64		/* default 32k bounce buffer */
+#endif
+
+#define BBSIZE (AT91_MCI_MAX_BLOCKS * 512)
+
+static int mci_debug;
 
 struct at91_mci_softc {
 	void *intrhand;			/* Interrupt handle */
@@ -75,21 +81,24 @@
 	int sc_cap;
 #define	CAP_HAS_4WIRE		1	/* Has 4 wire bus */
 #define	CAP_NEEDS_BYTESWAP	2	/* broken hardware needing bounce */
-	int flags;
 	int has_4wire;
-#define CMD_STARTED	1
-#define STOP_STARTED	2
+	int flags;
+#define PENDING_CMD	0x01
+#define PENDING_STOP	0x02
+#define PENDING_ERROR	0x04
+#define CMD_MULTIREAD	0x10
+#define CMD_MULTIWRITE	0x20
 	struct resource *irq_res;	/* IRQ resource */
 	struct resource	*mem_res;	/* Memory resource */
 	struct mtx sc_mtx;
 	bus_dma_tag_t dmatag;
 	bus_dmamap_t map;
-	int mapped;
 	struct mmc_host host;
 	int bus_busy;
 	struct mmc_request *req;
 	struct mmc_command *curcmd;
-	char bounce_buffer[BBSZ];
+	char      *bbuf_vaddr;		/* bounce buf in KVA space */
+	bus_addr_t bbuf_paddr;		/* bounce buf mapped into DMA space */
 };
 
 static inline uint32_t
@@ -124,6 +133,14 @@
 #define AT91_MCI_ASSERT_UNLOCKED(_sc) mtx_assert(&_sc->sc_mtx, MA_NOTOWNED);
 
 static void
+at91_mci_getaddr(void *arg, bus_dma_segment_t *segs, int nsegs, int error)
+{
+	if (error != 0)
+		return;
+	*(bus_addr_t *)arg = segs[0].ds_addr;
+}
+
+static void
 at91_mci_pdc_disable(struct at91_mci_softc *sc)
 {
 	WR4(sc, PDC_PTCR, PDC_PTCR_TXTDIS | PDC_PTCR_RXTDIS);
@@ -137,6 +154,48 @@
 	WR4(sc, PDC_TNCR, 0);
 }
 
+/* Reset the controller, then restore most of the current state.
+ *
+ * This is called after detecting an error.  It's also called after stopping a
+ * multi-block write, to un-wedge the device so that it will handle the NOTBUSY
+ * signal correctly.  See comments in at91_mci_stop_done() for more details.
+ */
+static void at91_mci_reset(struct at91_mci_softc *sc)
+{
+	uint32_t mr;
+	uint32_t sdcr;
+	uint32_t dtor;
+	uint32_t imr;
+
+	at91_mci_pdc_disable(sc);
+
+	/* save current state */
+
+	imr  = RD4(sc, MCI_IMR);
+	mr   = RD4(sc, MCI_MR) & 0x7fff;
+	sdcr = RD4(sc, MCI_SDCR);
+	dtor = RD4(sc, MCI_DTOR);
+
+	/* reset the controller */
+
+	WR4(sc, MCI_IDR, 0xffffffff);
+	WR4(sc, MCI_CR, MCI_CR_MCIDIS | MCI_CR_SWRST);
+
+	/* restore state */
+
+	WR4(sc, MCI_CR, MCI_CR_MCIEN);
+	WR4(sc, MCI_MR, mr);
+	WR4(sc, MCI_SDCR, sdcr);
+	WR4(sc, MCI_DTOR, dtor);
+	WR4(sc, MCI_IER, imr);
+
+	/* Make sure sdio interrupts will fire.  Not sure why reading
+	 * SR ensures that, but this is in the linux driver.
+	 */
+
+	RD4(sc, MCI_SR);
+}
+
 static void
 at91_mci_init(device_t dev)
 {
@@ -145,7 +204,7 @@
 	WR4(sc, MCI_CR, MCI_CR_MCIEN);		/* Enable controller */
 	WR4(sc, MCI_IDR, 0xffffffff);		/* Turn off interrupts */
 	WR4(sc, MCI_DTOR, MCI_DTOR_DTOMUL_1M | 1);
-	WR4(sc, MCI_MR, 0x834a);	// XXX GROSS HACK FROM LINUX
+	WR4(sc, MCI_MR, 0x834a); // set PDCMODE, PWSDIV=3, CLKDIV=75
 #ifndef  AT91_MCI_SLOT_B
 	WR4(sc, MCI_SDCR, 0);			/* SLOT A, 1 bit bus */
 #else
@@ -168,7 +227,6 @@
 static int
 at91_mci_probe(device_t dev)
 {
-
 	device_set_desc(dev, "MCI mmc/sd host bridge");
 	return (0);
 }
@@ -193,24 +251,44 @@
 
 	AT91_MCI_LOCK_INIT(sc);
 
-	/*
-	 * Allocate DMA tags and maps
+	at91_mci_fini(dev);
+	at91_mci_init(dev);
+
+	/* Allocate DMA tags and maps and a physically-contiguous bounce buffer.
+	 *
+	 * The parms in the tag_create call cause the dmamem_alloc call to
+	 * create a single contiguous buffer of BBSIZE bytes aligned to a 4096
+	 * byte boundary.
+	 *
+	 * Allocate the bounce buffer using DMA_COHERENT which in effect means
+	 * that the pages are mapped as non-cacheable (making all the
+	 * bus_dmamap_sync() calls become fast no-ops) because there's no value
+	 * in caching the data in the swap/bounce buffers (there would just be
+	 * extra overhead in flushing the caches after the data has been
+	 * accessed exactly once).
+	 *
+	 * After allocating the bounce buffer we load the map for it and leave
+	 * it loaded as long as the driver is active.
 	 */
-	err = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0,
-	    BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, MAXPHYS, 1,
-	    MAXPHYS, BUS_DMA_ALLOCNOW, NULL, NULL, &sc->dmatag);
+
+	err = bus_dma_tag_create(bus_get_dma_tag(dev), 4096, 0,
+	    BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, 
+	    BBSIZE, 1, MAXPHYS, 0, NULL, NULL, &sc->dmatag);
 	if (err != 0)
 		goto out;
 
-	err = bus_dmamap_create(sc->dmatag, 0,  &sc->map);
+	err = bus_dmamem_alloc(sc->dmatag, (void **)&sc->bbuf_vaddr,
+	    BUS_DMA_NOWAIT|BUS_DMA_COHERENT, &sc->map);
 	if (err != 0)
 		goto out;
 
-	at91_mci_fini(dev);
-	at91_mci_init(dev);
+	 err = bus_dmamap_load(sc->dmatag, sc->map, sc->bbuf_vaddr, BBSIZE,
+	    at91_mci_getaddr, &sc->bbuf_paddr, BUS_DMA_NOWAIT);
+	 if (err != 0)
+		 goto out;
 
 	/*
-	 * Activate the interrupt
+	 * Set up to handle interrupts.
 	 */
 	err = bus_setup_intr(dev, sc->irq_res, INTR_TYPE_MISC | INTR_MPSAFE,
 	    at91_mci_intr, sc, &sc->intrhand);
@@ -230,10 +308,13 @@
 	if (sc->has_4wire)
 		sc->sc_cap |= CAP_HAS_4WIRE;
 
-	sc->host.f_min = at91_master_clock / 512;
+	/* Our real min freq is master_clock/512, but upper driver layers are
+	 * going to set the min speed during card discovery, and the right speed
+	 * for that is 400khz, so advertise a safe value just under that.
+	 */
 	sc->host.f_min = 375000;
 	sc->host.f_max = at91_master_clock / 2;
-	if (sc->host.f_max > 50000000)	
+	if (sc->host.f_max > 50000000)
 		sc->host.f_max = 50000000;	/* Limit to 50MHz */
 
 	sc->host.host_ocr = MMC_OCR_320_330 | MMC_OCR_330_340;
@@ -252,8 +333,15 @@
 static int
 at91_mci_detach(device_t dev)
 {
+	struct at91_mci_softc *sc = device_get_softc(dev);
+
 	at91_mci_fini(dev);
 	at91_mci_deactivate(dev);
+
+	bus_dmamap_unload(sc->dmatag, sc->map);
+	bus_dmamem_free(sc->dmatag, sc->bbuf_vaddr, sc->map);
+	bus_dma_tag_destroy(sc->dmatag);
+
 	return (EBUSY);	/* XXX */
 }
 
@@ -293,7 +381,7 @@
 	sc->intrhand = 0;
 	bus_generic_detach(sc->dev);
 	if (sc->mem_res)
-		bus_release_resource(dev, SYS_RES_IOPORT,
+		bus_release_resource(dev, SYS_RES_MEMORY,
 		    rman_get_rid(sc->mem_res), sc->mem_res);
 	sc->mem_res = 0;
 	if (sc->irq_res)
@@ -303,14 +391,6 @@
 	return;
 }
 
-static void
-at91_mci_getaddr(void *arg, bus_dma_segment_t *segs, int nsegs, int error)
-{
-	if (error != 0)
-		return;
-	*(bus_addr_t *)arg = segs[0].ds_addr;
-}
-
 static int
 at91_mci_update_ios(device_t brdev, device_t reqdev)
 {
@@ -322,7 +402,7 @@
 	sc = device_get_softc(brdev);
 	host = &sc->host;
 	ios = &host->ios;
-	// bus mode?
+
 	if (ios->clock == 0) {
 		WR4(sc, MCI_CR, MCI_CR_MCIDIS);
 		clkdiv = 0;
@@ -346,137 +426,176 @@
 static void
 at91_mci_start_cmd(struct at91_mci_softc *sc, struct mmc_command *cmd)
 {
-	uint32_t cmdr, ier = 0, mr;
+	uint32_t cmdr, mr;
 	uint32_t *src, *dst;
 	int i;
 	struct mmc_data *data;
-	void *vaddr;
-	bus_addr_t paddr;
 
 	sc->curcmd = cmd;
 	data = cmd->data;
-	cmdr = cmd->opcode;
 
 	/* XXX Upper layers don't always set this */
 	cmd->mrq = sc->req;
 
+	/* Begin setting up command register. */
+
+	cmdr = cmd->opcode;
+
+	if (sc->host.ios.bus_mode == opendrain)
+		cmdr |= MCI_CMDR_OPDCMD;
+
+	/* Set up response handling.  Allow max timeout for responses. */
+
 	if (MMC_RSP(cmd->flags) == MMC_RSP_NONE)
 		cmdr |= MCI_CMDR_RSPTYP_NO;
 	else {
-		/* Allow big timeout for responses */
 		cmdr |= MCI_CMDR_MAXLAT;
 		if (cmd->flags & MMC_RSP_136)
 			cmdr |= MCI_CMDR_RSPTYP_136;
 		else
 			cmdr |= MCI_CMDR_RSPTYP_48;
 	}
-	if (cmd->opcode == MMC_STOP_TRANSMISSION)
-		cmdr |= MCI_CMDR_TRCMD_STOP;
-	if (sc->host.ios.bus_mode == opendrain)
-		cmdr |= MCI_CMDR_OPDCMD;
-	if (!data) {
-		// The no data case is fairly simple
+
+	/* If there is no data transfer, just set up the right interrupt mask
+	 * and start the command.
+	 *
+	 * The interrupt mask needs to be CMDRDY plus all non-data-transfer
+	 * errors. It's important to leave the transfer-related errors out, to
+	 * avoid spurious timeout or crc errors on a STOP command following a
+	 * multiblock read.  When a multiblock read is in progress, sending a
+	 * STOP in the middle of a block occasionally triggers such errors, but
+	 * we're totally disinterested in them because we've already gotten all
+	 * the data we wanted without error before sending the STOP command.
+	 */
+
+	if (data == NULL) {
+		uint32_t ier = MCI_SR_CMDRDY | 
+                    MCI_SR_RTOE | MCI_SR_RENDE | 
+		    MCI_SR_RCRCE | MCI_SR_RDIRE | MCI_SR_RINDE;
+
 		at91_mci_pdc_disable(sc);
-//		printf("CMDR %x ARGR %x\n", cmdr, cmd->arg);
+
+		if (cmd->opcode == MMC_STOP_TRANSMISSION)
+			cmdr |= MCI_CMDR_TRCMD_STOP;
+
+		/* Ignore response CRC on CMD2 and ACMD41, per standard. */
+
+		if (cmd->opcode == MMC_SEND_OP_COND ||
+		    cmd->opcode == ACMD_SD_SEND_OP_COND)
+			ier &= ~MCI_SR_RCRCE;
+
+		if (mci_debug)
+			printf("CMDR %x (opcode %d) ARGR %x no data\n", 
+			    cmdr, cmd->opcode, cmd->arg);
+
 		WR4(sc, MCI_ARGR, cmd->arg);
 		WR4(sc, MCI_CMDR, cmdr);
-		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_CMDRDY);
+		WR4(sc, MCI_IER, ier);
 		return;
 	}
+
+	/* There is data, set up the transfer-related parts of the command. */
+
 	if (data->flags & MMC_DATA_READ)
 		cmdr |= MCI_CMDR_TRDIR;
+
 	if (data->flags & (MMC_DATA_READ | MMC_DATA_WRITE))
 		cmdr |= MCI_CMDR_TRCMD_START;
+
 	if (data->flags & MMC_DATA_STREAM)
 		cmdr |= MCI_CMDR_TRTYP_STREAM;
-	if (data->flags & MMC_DATA_MULTI)
+	else if (data->flags & MMC_DATA_MULTI) {
 		cmdr |= MCI_CMDR_TRTYP_MULTIPLE;
-	// Set block size and turn on PDC mode for dma xfer and disable
-	// PDC until we're ready.
-	mr = RD4(sc, MCI_MR) & ~MCI_MR_BLKLEN;
-	WR4(sc, MCI_MR, mr | (data->len << 16) | MCI_MR_PDCMODE);
-	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
-	if (cmdr & MCI_CMDR_TRCMD_START) {
-		if (cmdr & MCI_CMDR_TRDIR)
-			vaddr = cmd->data->data;
-		else {
-			/* Use bounce buffer even if we don't need
-			 * byteswap, since buffer may straddle a page
-			 * boundry, and we don't handle multi-segment
-			 * transfers in hardware.
-			 * (page issues seen from 'bsdlabel -w' which
-			 * uses raw geom access to the volume).
-			 * Greg Ansley (gja (at) ansley.com)
-			 */
-			vaddr = sc->bounce_buffer;
-			src = (uint32_t *)cmd->data->data;
-			dst = (uint32_t *)vaddr;
-			if (sc->sc_cap & CAP_NEEDS_BYTESWAP) {
-				for (i = 0; i < data->len / 4; i++)
-					dst[i] = bswap32(src[i]);
-			} else
-				memcpy(dst, src, data->len);
-		}
-		data->xfer_len = 0;
-		if (bus_dmamap_load(sc->dmatag, sc->map, vaddr, data->len,
-		    at91_mci_getaddr, &paddr, 0) != 0) {
-			cmd->error = MMC_ERR_NO_MEMORY;
-			sc->req = NULL;
-			sc->curcmd = NULL;
-			cmd->mrq->done(cmd->mrq);
-			return;
-		}
-		sc->mapped++;
-		if (cmdr & MCI_CMDR_TRDIR) {
+		sc->flags |= (data->flags & MMC_DATA_READ) ? 
+				CMD_MULTIREAD : CMD_MULTIWRITE;
+	}
+
+	/* Disable PDC until we're ready.
+	 *
+	 * Set block size and turn on PDC mode for dma xfer.
+	 * Note that the block size is the smaller of the amount of data to be
+	 * transferred, or 512 bytes.  The 512 size is fixed by the standard;
+	 * smaller blocks are possible, but never larger.
+	 */
+
+	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS); 
+
+	mr = RD4(sc,MCI_MR) & ~MCI_MR_BLKLEN; 
+	mr |=  min(data->len, 512) << 16; 
+	WR4(sc, MCI_MR, mr | MCI_MR_PDCMODE);
+
+	/* Set up DMA.
+	 *
+	 * Use a bounce buffer even if we don't need to byteswap, because doing
+	 * multi-block IO with a single large DMA buffer is way fast (compared
+	 * to single-block IO), even after incurring the overhead of also
+	 * copying from/to the caller's buffers (which may be in non-contiguous
+	 * physical pages).
+	 *
+	 * In an ideal non-byteswap world we could create a dma tag that allows
+	 * for discontiguous segments and do the IO directly from/to the
+	 * caller's buffer(s), using ENDRX/ENDTX interrupts to chain the
+	 * discontiguous buffers through the PDC. Someday.
+	 *
+	 * XXX what about stream transfers?
+	 */
+
+	if (data->flags & (MMC_DATA_READ | MMC_DATA_WRITE)) {
+		if (data->flags & MMC_DATA_READ) {
 			bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_PREREAD);
-			WR4(sc, PDC_RPR, paddr);
+			WR4(sc, PDC_RPR, sc->bbuf_paddr);
 			WR4(sc, PDC_RCR, data->len / 4);
-			ier = MCI_SR_ENDRX;
+			WR4(sc, PDC_PTCR, PDC_PTCR_RXTEN);
 		} else {
+			if (sc->sc_cap & CAP_NEEDS_BYTESWAP) {
+				src = (uint32_t *)data->data;
+				dst = (uint32_t *)sc->bbuf_vaddr;
+				for (i = 0; i < data->len / 4; i++)
+					dst[i] = bswap32(src[i]);
+			} else {
+				bcopy(data->data, sc->bbuf_vaddr, data->len);
+			}
 			bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_PREWRITE);
-			WR4(sc, PDC_TPR, paddr);
+			WR4(sc, PDC_TPR, sc->bbuf_paddr);
 			WR4(sc, PDC_TCR, data->len / 4);
-			ier = MCI_SR_TXBUFE;
+			/* do not enable PDC xfer until CMDRDY asserted */
 		}
+		data->xfer_len = 0; /* XXX what's this? appears to be unused. */
 	}
-//	printf("CMDR %x ARGR %x with data\n", cmdr, cmd->arg);
+
+	if (mci_debug)
+		printf("CMDR %x (opcode %d) ARGR %x with data\n", cmdr, cmd->opcode, cmd->arg);
+
 	WR4(sc, MCI_ARGR, cmd->arg);
-	if (cmdr & MCI_CMDR_TRCMD_START) {
-		if (cmdr & MCI_CMDR_TRDIR) {
-			WR4(sc, PDC_PTCR, PDC_PTCR_RXTEN);
-			WR4(sc, MCI_CMDR, cmdr);
-		} else {
-			WR4(sc, MCI_CMDR, cmdr);
-			WR4(sc, PDC_PTCR, PDC_PTCR_TXTEN);
-		}
-	}
-	WR4(sc, MCI_IER, MCI_SR_ERROR | ier);
+	WR4(sc, MCI_CMDR, cmdr);
+	WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_CMDRDY);
 }
 
 static void
-at91_mci_start(struct at91_mci_softc *sc)
+at91_mci_next_operation(struct at91_mci_softc *sc)
 {
 	struct mmc_request *req;
 
 	req = sc->req;
 	if (req == NULL)
 		return;
-	// assert locked
-	if (!(sc->flags & CMD_STARTED)) {
-		sc->flags |= CMD_STARTED;
-//		printf("Starting CMD\n");
-		at91_mci_start_cmd(sc, req->cmd);
-		return;
-	}
-	if (!(sc->flags & STOP_STARTED) && req->stop) {
-//		printf("Starting Stop\n");
-		sc->flags |= STOP_STARTED;
-		at91_mci_start_cmd(sc, req->stop);
-		return;
+
+	if (!(sc->flags & PENDING_ERROR)) {
+		if (sc->flags & PENDING_CMD) {
+			sc->flags &= ~PENDING_CMD;
+			at91_mci_start_cmd(sc, req->cmd);
+			return;
+		} else if (sc->flags & PENDING_STOP) {
+			sc->flags &= ~PENDING_STOP;
+			at91_mci_start_cmd(sc, req->stop);
+			return;
+		}
 	}
-	/* We must be done -- bad idea to do this while locked? */
+
+	WR4(sc, MCI_IDR, 0xffffffff);
 	sc->req = NULL;
 	sc->curcmd = NULL;
+	//printf("req done\n");
 	req->done(req);
 }
 
@@ -486,16 +605,16 @@
 	struct at91_mci_softc *sc = device_get_softc(brdev);
 
 	AT91_MCI_LOCK(sc);
-	// XXX do we want to be able to queue up multiple commands?
-	// XXX sounds like a good idea, but all protocols are sync, so
-	// XXX maybe the idea is naive...
 	if (sc->req != NULL) {
 		AT91_MCI_UNLOCK(sc);
 		return (EBUSY);
 	}
+	//printf("new req\n");
 	sc->req = req;
-	sc->flags = 0;
-	at91_mci_start(sc);
+	sc->flags = PENDING_CMD;
+	if (sc->req->stop)
+		sc->flags |= PENDING_STOP;
+	at91_mci_next_operation(sc);
 	AT91_MCI_UNLOCK(sc);
 	return (0);
 }
@@ -535,118 +654,299 @@
 static void
 at91_mci_read_done(struct at91_mci_softc *sc)
 {
-	uint32_t *walker;
-	struct mmc_command *cmd;
-	int i, len;
+	struct mmc_command *cmd = sc->curcmd;
+
+	/* We arrive here when the entire DMA transfer for a read is done,
+	 * whether it's a single or multi-block read.  Either way the next thing
+	 * to do is move on to the next operation.  For single-block that'll
+	 * mean returning the now-completed request, for multi-block it will
+	 * invoke the stop command sequence.
+	 */
+
+	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 
-	cmd = sc->curcmd;
 	bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_POSTREAD);
-	bus_dmamap_unload(sc->dmatag, sc->map);
-	sc->mapped--;
+
 	if (sc->sc_cap & CAP_NEEDS_BYTESWAP) {
-		walker = (uint32_t *)cmd->data->data;
-		len = cmd->data->len / 4;
+		int i;
+		uint32_t *src = (uint32_t *)sc->bbuf_vaddr;
+		uint32_t *dst = (uint32_t *)cmd->data->data;
+		int len = cmd->data->len / 4;
 		for (i = 0; i < len; i++)
-			walker[i] = bswap32(walker[i]);
+			dst[i] = bswap32(src[i]);
+	} else {
+		bcopy(sc->bbuf_vaddr, cmd->data->data, cmd->data->len);
 	}
-	// Finish up the sequence...
-	WR4(sc, MCI_IDR, MCI_SR_ENDRX);
-	WR4(sc, MCI_IER, MCI_SR_RXBUFF);
-	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
+
+	cmd->error = MMC_ERR_NONE;
+	at91_mci_next_operation(sc);
 }
 
 static void
-at91_mci_xmit_done(struct at91_mci_softc *sc)
+at91_mci_write_done(struct at91_mci_softc *sc, uint32_t sr)
 {
-	// Finish up the sequence...
+	struct mmc_command *cmd = sc->curcmd;
+
+	/* We arrive here when the entire DMA transfer for a write is done,
+	 * whether it's a single or multi-block write.  If it's multi-block we
+	 * have to immediately move on to the next operation which is to send
+	 * the stop command.  If it's a single-block transfer we need to wait
+	 * for NOTBUSY, but if that's already asserted we can avoid another
+	 * interrupt and just move on to completing the request right away.
+	 */
+
 	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
-	WR4(sc, MCI_IDR, MCI_SR_TXBUFE);
-	WR4(sc, MCI_IER, MCI_SR_NOTBUSY);
+
 	bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_POSTWRITE);
-	bus_dmamap_unload(sc->dmatag, sc->map);
-	sc->mapped--;
+
+	if ((cmd->data->flags & MMC_DATA_MULTI) || (sr & MCI_SR_NOTBUSY)) {
+                cmd->error = MMC_ERR_NONE;
+                at91_mci_next_operation(sc);
+	} else {
+		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
+	}
+}
+
+static void
+at91_mci_notbusy(struct at91_mci_softc *sc)
+{
+	struct mmc_command *cmd = sc->curcmd;
+
+	/* We arrive here by either completion of a single-block write, or
+	 * completion of the stop command that ended a multi-block write (and, I
+	 * suppose, after a card-select or erase, but I haven't tested those).
+	 * Anyway, we're done and it's time to move on to the next command.
+	 */
+
+	cmd->error = MMC_ERR_NONE;
+	at91_mci_next_operation(sc);
+}
+
+static void
+at91_mci_stop_done(struct at91_mci_softc *sc, uint32_t sr)
+{
+	struct mmc_command *cmd = sc->curcmd;
+
+	/* We arrive here after receiving CMDRDY for a MMC_STOP_TRANSMISSION
+	 * command.  Depending on the operation being stopped, we may have to do
+	 * some unusual things to work around hardware bugs.
+	 */
+
+	/* This is known to be true of at91rm9200 hardware; it may or may not
+	 * apply to more recent chips: 
+	 *
+	 * After stopping a multi-block write, the NOTBUSY bit in MCI_SR does
+	 * not properly reflect the actual busy state of the card as signaled on
+	 * the DAT0 line; it always claims the card is not-busy.  If we believe
+	 * that and let operations continue, following commands will fail with
+	 * response timeouts (except of course MMC_SEND_STATUS -- it indicates
+	 * the card is busy in the PRG state, which was the smoking gun that
+	 * showed MCI_SR NOTBUSY was not tracking DAT0 correctly).
+	 *
+	 * The atmel docs are emphatic: "This flag [NOTBUSY] must be used only
+	 * for Write Operations."  I guess technically since we sent a stop it's
+	 * not a write operation anymore.  But then just what did they think it
+	 * meant for the stop command to have "...an optional busy signal
+	 * transmitted on the data line" according to the SD spec?
+	 *
+	 * I tried a variety of things to un-wedge the MCI and get the status
+	 * register to reflect NOTBUSY correctly again, but the only thing that
+	 * worked was a full device reset.  It feels like an awfully big hammer,
+	 * but doing a full reset after every multiblock write is still faster
+	 * than doing single-block IO (by almost two orders of magnitude:
+	 * 20KB/sec improves to about 1.8MB/sec best case).
+	 *
+	 * After doing the reset, wait for a NOTBUSY interrupt before continuing
+	 * with the next operation.
+	 */
+
+	if (sc->flags & CMD_MULTIWRITE) {
+		at91_mci_reset(sc);
+		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
+		return;
+	}
+
+	/* This is known to be true of at91rm9200 hardware; it may or may not
+	 * apply to more recent chips: 
+	 *
+	 * After stopping a multi-block read, loop to read and discard any data
+	 * that coasts in after we sent the stop command.  The docs don't say
+	 * anything about it, but empirical testing shows that 1-3 additional
+	 * words of data get buffered up in some unmentioned internal fifo and
+	 * if we don't read and discard them here they end up on the front of
+	 * the next read DMA transfer we do.
+	 */
+
+	if (sc->flags & CMD_MULTIREAD) {
+		uint32_t sr;
+		int count = 0;
+		do {
+			sr = RD4(sc, MCI_SR);
+			if (sr & MCI_SR_RXRDY) {
+				RD4(sc,  MCI_RDR);
+				++count;
+			}
+		} while (sr & MCI_SR_RXRDY);
+//              if (count != 0)
+//                      printf("Had to soak up %d words after read\n", count);
+	}
+
+	cmd->error = MMC_ERR_NONE;
+	at91_mci_next_operation(sc);
+
+}
+
+static void
+at91_mci_cmdrdy(struct at91_mci_softc *sc, uint32_t sr)
+{
+	struct mmc_command *cmd = sc->curcmd;
+	int i;
+
+	if (cmd == NULL)
+		return;
+
+	/* We get here at the end of EVERY command.  We retrieve the command
+	 * response (if any) then decide what to do next based on the command.
+	 */
+
+	if (cmd->flags & MMC_RSP_PRESENT) {
+		for (i = 0; i < ((cmd->flags & MMC_RSP_136) ? 4 : 1); i++) {
+			cmd->resp[i] = RD4(sc, MCI_RSPR + i * 4);
+			if (mci_debug)
+				printf("RSPR[%d] = %x sr=%x\n", i, cmd->resp[i],  sr);
+		}
+	}
+
+	/* If this was a stop command, go handle the various special
+	 * conditions (read: bugs) that have to be dealt with following a stop.
+	 */
+
+	if (cmd->opcode == MMC_STOP_TRANSMISSION) {
+		at91_mci_stop_done(sc, sr);
+		return;
+	}
+
+	/* If this command can continue to assert BUSY beyond the response then
+	 * we need to wait for NOTBUSY before the command is really done.
+	 *
+	 * Note that this may not work properly on the at91rm9200.  It certainly
+	 * doesn't work for the STOP command that follows a multi-block write,
+	 * so post-stop CMDRDY is handled separately; see the special handling
+	 * in at91_mci_stop_done().
+	 *
+	 * Beside STOP, there are other R1B-type commands that use the busy
+	 * signal after CMDRDY: CMD7 (card select), CMD28-29 (write protect),
+	 * CMD38 (erase). I haven't tested any of them, but I rather expect
+	 * them all to have the same sort of problem with MCI_SR not actually
+	 * reflecting the state of the DAT0-line busy indicator.  So this code
+	 * may need to grow some sort of special handling for them too. (This
+	 * just in: CMD7 isn't a problem right now because dev/mmc.c incorrectly
+	 * sets the response flags to R1 rather than R1B.)
+	 */
+
+	if ((cmd->flags & MMC_RSP_BUSY)) {
+		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
+		return;
+	}
+
+	/* If there is a data transfer with this command, then...
+	 * - If it's a read, we need to wait for ENDRX.
+	 * - If it's a write, now is the time to enable the PDC and we need to
+	 *   wait for BLKE.
+	 */
+
+	if (cmd->data) {
+		uint32_t ier;
+		if (cmd->data->flags & MMC_DATA_READ) {
+			ier = MCI_SR_ENDRX;
+		} else {
+			ier = MCI_SR_BLKE;
+			WR4(sc, PDC_PTCR, PDC_PTCR_TXTEN);
+		}
+		WR4(sc, MCI_IER, MCI_SR_ERROR | ier);
+		return;
+	}
+
+	/* If we made it to here, we don't need to wait for anything more for
+	 * the current command, move on to the next command (will complete the
+	 * request if there is no next command).
+	 */
+
+	cmd->error = MMC_ERR_NONE;
+	at91_mci_next_operation(sc);
 }
 
 static void
 at91_mci_intr(void *arg)
 {
 	struct at91_mci_softc *sc = (struct at91_mci_softc*)arg;
-	uint32_t sr;
-	int i, done = 0;
-	struct mmc_command *cmd;
+	struct mmc_command *cmd = sc->curcmd;
+	uint32_t sr, isr;
 
 	AT91_MCI_LOCK(sc);
-	sr = RD4(sc, MCI_SR) & RD4(sc, MCI_IMR);
-//	printf("i 0x%x\n", sr);
-	cmd = sc->curcmd;
-	if (sr & MCI_SR_ERROR) {
-		// Ignore CRC errors on CMD2 and ACMD47, per relevant standards
-		if ((sr & MCI_SR_RCRCE) && (cmd->opcode == MMC_SEND_OP_COND ||
-		    cmd->opcode == ACMD_SD_SEND_OP_COND))
-			cmd->error = MMC_ERR_NONE;
-		else if (sr & (MCI_SR_RTOE | MCI_SR_DTOE))
+
+	sr = RD4(sc, MCI_SR);
+	isr = sr & RD4(sc, MCI_IMR);
+
+	if (mci_debug)
+		printf("i 0x%x sr 0x%x\n", isr, sr);
+
+	/* All interrupts are one-shot; disable it now.
+	 * The next operation will re-enable whatever interrupts it wants.
+	 */
+
+	WR4(sc, MCI_IDR, isr);
+
+	if (isr & MCI_SR_ERROR) {
+		if (isr & (MCI_SR_RTOE | MCI_SR_DTOE))
 			cmd->error = MMC_ERR_TIMEOUT;
-		else if (sr & (MCI_SR_RCRCE | MCI_SR_DCRCE))
+		else if (isr & (MCI_SR_RCRCE | MCI_SR_DCRCE))
 			cmd->error = MMC_ERR_BADCRC;
-		else if (sr & (MCI_SR_OVRE | MCI_SR_UNRE))
+		else if (isr & (MCI_SR_OVRE | MCI_SR_UNRE))
 			cmd->error = MMC_ERR_FIFO;
 		else
 			cmd->error = MMC_ERR_FAILED;
-		done = 1;
-		if (sc->mapped && cmd->error) {
-			bus_dmamap_unload(sc->dmatag, sc->map);
-			sc->mapped--;
-		}
+		device_printf(sc->dev, 
+		    "IO error; status MCI_SR = 0x%x cmd opcode = %d\n",  
+		    sr, cmd->opcode);
+		sc->flags |= PENDING_ERROR;
+		at91_mci_reset(sc);
+		at91_mci_next_operation(sc);
 	} else {
-		if (sr & MCI_SR_TXBUFE) {
+		if (isr & MCI_SR_TXBUFE) {
 //			printf("TXBUFE\n");
-			at91_mci_xmit_done(sc);
 		}
-		if (sr & MCI_SR_RXBUFF) {
+		if (isr & MCI_SR_RXBUFF) {
 //			printf("RXBUFF\n");
-			WR4(sc, MCI_IDR, MCI_SR_RXBUFF);
-			WR4(sc, MCI_IER, MCI_SR_CMDRDY);
 		}
-		if (sr & MCI_SR_ENDTX) {
+		if (isr & MCI_SR_ENDTX) {
 //			printf("ENDTX\n");
 		}
-		if (sr & MCI_SR_ENDRX) {
+		if (isr & MCI_SR_ENDRX) {
 //			printf("ENDRX\n");
 			at91_mci_read_done(sc);
 		}
-		if (sr & MCI_SR_NOTBUSY) {
+		if (isr & MCI_SR_NOTBUSY) {
 //			printf("NOTBUSY\n");
-			WR4(sc, MCI_IDR, MCI_SR_NOTBUSY);
-			WR4(sc, MCI_IER, MCI_SR_CMDRDY);
+			at91_mci_notbusy(sc);
 		}
-		if (sr & MCI_SR_DTIP) {
+		if (isr & MCI_SR_DTIP) {
 //			printf("Data transfer in progress\n");
 		}
-		if (sr & MCI_SR_BLKE) {
+		if (isr & MCI_SR_BLKE) {
 //			printf("Block transfer end\n");
+			at91_mci_write_done(sc, sr);
 		}
-		if (sr & MCI_SR_TXRDY) {
+		if (isr & MCI_SR_TXRDY) {
 //			printf("Ready to transmit\n");
 		}
-		if (sr & MCI_SR_RXRDY) {
+		if (isr & MCI_SR_RXRDY) {
 //			printf("Ready to receive\n");
 		}
-		if (sr & MCI_SR_CMDRDY) {
+		if (isr & MCI_SR_CMDRDY) {
 //			printf("Command ready\n");
-			done = 1;
-			cmd->error = MMC_ERR_NONE;
-		}
-	}
-	if (done) {
-		WR4(sc, MCI_IDR, 0xffffffff);
-		if (cmd != NULL && (cmd->flags & MMC_RSP_PRESENT)) {
-			for (i = 0; i < ((cmd->flags & MMC_RSP_136) ? 4 : 1);
-			     i++) {
-				cmd->resp[i] = RD4(sc, MCI_RSPR + i * 4);
-//				printf("RSPR[%d] = %x\n", i, cmd->resp[i]);
-			}
+			at91_mci_cmdrdy(sc, sr);
 		}
-		at91_mci_start(sc);
 	}
 	AT91_MCI_UNLOCK(sc);
 }
@@ -703,7 +1003,7 @@
 		*(int *)result = sc->host.caps;
 		break;
 	case MMCBR_IVAR_MAX_DATA:
-		*(int *)result = 1;
+		*(int *)result = AT91_MCI_MAX_BLOCKS;
 		break;
 	}
 	return (0);
--- patch-arm-at91_mci-for-multiblock ends here ---

>Release-Note:
>Audit-Trail:

From: Bernd Walter <ticso@cicely7.cicely.de>
To: Ian Lepore <freebsd@damnhippie.dyndns.org>
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: arm/155214: [patch] MMC/SD IO slow on Atmel ARM with modern large SD cards
Date: Thu, 3 Mar 2011 00:52:51 +0100

 On Wed, Mar 02, 2011 at 02:53:18PM -0700, Ian Lepore wrote:
 > 
 > >Number:         155214
 > >Category:       arm
 > >Synopsis:       [patch] MMC/SD IO slow on Atmel ARM with modern large SD cards
 > >Confidential:   no
 > >Severity:       serious
 > >Priority:       medium
 > >Responsible:    freebsd-arm
 > >State:          open
 > >Quarter:        
 > >Keywords:       
 > >Date-Required:
 > >Class:          sw-bug
 > >Submitter-Id:   current-users
 > >Arrival-Date:   Wed Mar 02 22:10:10 UTC 2011
 > >Closed-Date:
 > >Last-Modified:
 > >Originator:     Ian Lepore <freebsd@damnhippie.dyndns.org>
 > >Release:        FreeBSD 8.2-RC3 arm
 > >Organization:
 > none
 > >Environment:
 > FreeBSD dvb 8.2-RC3 FreeBSD 8.2-RC3 #49: Tue Feb 15 22:52:14 UTC 2011     root@revolution.hippie.lan:/usr/obj/arm/usr/src/sys/DVB  arm
 > 
 > Included patch is against -current even though the problem was first seen on
 > 8.2-RC3
 > 
 > The problem was seen on AT91RM9200 hardware, but presumably also affects the
 > SAM9 series which uses the same driver code.
 > 
 > >Description:
 > With the latest generation of large-capacity SD cards, write speeds as low as
 > 20 kbytes/sec are seen.  These modern cards have erase-block sizes as large as 
 > 8192K (compared to 32K typical on previous generations).  The at91_mci driver 
 > does only single-sector IO; apparently this requires the SD card to internally 
 > perform an expensive read-erase-modify-write cycle for each 512 byte block 
 > written to the card.
 
 The complete details of this problem are completely known.
 However the RM9200 has many hardware problems to be worked around and
 so far noone actually did.
 Your patch is quite large, so I would like to ask you explicitly:
 Did you test your patch with an AT91RM9200 system?
 You did enable multisector support for reading and (more important) for
 writing?
 But you didn't activate 4bit mode?
 With 4bit mode there is no hardware bug, but when the driver was written
 is was just done in a lazy way because activating 4bit on SD cards require
 special handling - in the meantime the SD layer itself was extracted and
 has 4bit support, but the at91_mci driver was never updated to use that.
 
 PS: I'm very pleased to see your work since SD write speed was a
 major show stopper for some applications
 
 -- 
 B.Walter <bernd@bwct.de> http://www.bwct.de
 Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm.

From: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: ticso@cicely.de
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: arm/155214: [patch] MMC/SD IO slow on Atmel ARM with modern
 large SD cards
Date: Wed, 02 Mar 2011 17:21:09 -0700

 On Thu, 2011-03-03 at 00:52 +0100, Bernd Walter wrote:
 > On Wed, Mar 02, 2011 at 02:53:18PM -0700, Ian Lepore wrote:
 > > 
 > > >Number:         155214
 > > >Category:       arm
 > > >Synopsis:       [patch] MMC/SD IO slow on Atmel ARM with modern large SD cards
 > > >Confidential:   no
 > > >Severity:       serious
 > > >Priority:       medium
 > > >Responsible:    freebsd-arm
 > > >State:          open
 > > >Quarter:        
 > > >Keywords:       
 > > >Date-Required:
 > > >Class:          sw-bug
 > > >Submitter-Id:   current-users
 > > >Arrival-Date:   Wed Mar 02 22:10:10 UTC 2011
 > > >Closed-Date:
 > > >Last-Modified:
 > > >Originator:     Ian Lepore <freebsd@damnhippie.dyndns.org>
 > > >Release:        FreeBSD 8.2-RC3 arm
 > > >Organization:
 > > none
 > > >Environment:
 > > FreeBSD dvb 8.2-RC3 FreeBSD 8.2-RC3 #49: Tue Feb 15 22:52:14 UTC 2011     root@revolution.hippie.lan:/usr/obj/arm/usr/src/sys/DVB  arm
 > > 
 > > Included patch is against -current even though the problem was first seen on
 > > 8.2-RC3
 > > 
 > > The problem was seen on AT91RM9200 hardware, but presumably also affects the
 > > SAM9 series which uses the same driver code.
 > > 
 > > >Description:
 > > With the latest generation of large-capacity SD cards, write speeds as low as
 > > 20 kbytes/sec are seen.  These modern cards have erase-block sizes as large as 
 > > 8192K (compared to 32K typical on previous generations).  The at91_mci driver 
 > > does only single-sector IO; apparently this requires the SD card to internally 
 > > perform an expensive read-erase-modify-write cycle for each 512 byte block 
 > > written to the card.
 > 
 > The complete details of this problem are completely known.
 > However the RM9200 has many hardware problems to be worked around and
 > so far noone actually did.
 > Your patch is quite large, so I would like to ask you explicitly:
 > Did you test your patch with an AT91RM9200 system?
 > You did enable multisector support for reading and (more important) for
 > writing?
 > But you didn't activate 4bit mode?
 > With 4bit mode there is no hardware bug, but when the driver was written
 > is was just done in a lazy way because activating 4bit on SD cards require
 > special handling - in the meantime the SD layer itself was extracted and
 > has 4bit support, but the at91_mci driver was never updated to use that.
 > 
 > PS: I'm very pleased to see your work since SD write speed was a
 > major show stopper for some applications
 > 
 
 Yes, the patch is large, partly because I included comments about the
 hardware problems I found and how the code works around them (and also
 to help the next person understand the flow).
 
 My changes support multi-sector IO for both reads and writes.
 
 The company I work for uses the AT91RM9200 on custom-designed boards in
 8 products, all with substantially similar board designs.  So far we've
 tested these changes on 4 of them, with no problems found.
 
 I have not tested with 4-bit enabled; I wasn't aware (but in retrospect
 I probably should have assumed) that the hardware bugs are different
 with 4-bit enabled.  I'm not even sure our hardware design carries all 4
 lines to the card; I'll look at the schematics and if they're connected
 I'll see about testing that mode.  (And if they're not I'll see about
 having our designers wire up all 4 lines on future designs.)
 
 I also haven't tested with the SAM9-series, because I don't have that
 hardware available.  (I hope to convince our hardware designers to
 migrate us to SAM9 this year.)
 
 

From: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: ticso@cicely.de
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: arm/155214: [patch] MMC/SD IO slow on Atmel ARM with modern
 large SD cards
Date: Fri, 04 Mar 2011 13:10:12 -0700

 On Thu, 2011-03-03 at 00:52 +0100, Bernd Walter wrote:
 > On Wed, Mar 02, 2011 at 02:53:18PM -0700, Ian Lepore wrote:
 > > 
 > > >Number:         155214
 > > >Category:       arm
 > > >Synopsis:       [patch] MMC/SD IO slow on Atmel ARM with modern large SD cards
 > > >Confidential:   no
 > > >Severity:       serious
 > > >Priority:       medium
 > > >Responsible:    freebsd-arm
 > > >State:          open
 > > >Quarter:        
 > > >Keywords:       
 > > >Date-Required:
 > > >Class:          sw-bug
 > > >Submitter-Id:   current-users
 > > >Arrival-Date:   Wed Mar 02 22:10:10 UTC 2011
 > > >Closed-Date:
 > > >Last-Modified:
 > > >Originator:     Ian Lepore <freebsd@damnhippie.dyndns.org>
 > > >Release:        FreeBSD 8.2-RC3 arm
 > > >Organization:
 > > none
 > > >Environment:
 > > FreeBSD dvb 8.2-RC3 FreeBSD 8.2-RC3 #49: Tue Feb 15 22:52:14 UTC 2011     root@revolution.hippie.lan:/usr/obj/arm/usr/src/sys/DVB  arm
 > > 
 > > Included patch is against -current even though the problem was first seen on
 > > 8.2-RC3
 > > 
 > > The problem was seen on AT91RM9200 hardware, but presumably also affects the
 > > SAM9 series which uses the same driver code.
 > > 
 > > >Description:
 > > With the latest generation of large-capacity SD cards, write speeds as low as
 > > 20 kbytes/sec are seen.  These modern cards have erase-block sizes as large as 
 > > 8192K (compared to 32K typical on previous generations).  The at91_mci driver 
 > > does only single-sector IO; apparently this requires the SD card to internally 
 > > perform an expensive read-erase-modify-write cycle for each 512 byte block 
 > > written to the card.
 > 
 > The complete details of this problem are completely known.
 > However the RM9200 has many hardware problems to be worked around and
 > so far noone actually did.
 > Your patch is quite large, so I would like to ask you explicitly:
 > Did you test your patch with an AT91RM9200 system?
 > You did enable multisector support for reading and (more important) for
 > writing?
 > But you didn't activate 4bit mode?
 > With 4bit mode there is no hardware bug, but when the driver was written
 > is was just done in a lazy way because activating 4bit on SD cards require
 > special handling - in the meantime the SD layer itself was extracted and
 > has 4bit support, but the at91_mci driver was never updated to use that.
 > 
 > PS: I'm very pleased to see your work since SD write speed was a
 > major show stopper for some applications
 > 
 
 I made some time today to try 4-bit mode in the mci driver, using 
 8.2-RELEASE as a testbed.  I quickly determined that just enabling 
 4-bit mode results in corrupted read data severe enough to virtually 
 always cause "root mount error" at boot.  Occasionally it'll manage to 
 mount root but then lock up or panic during rc-file processing.  It 
 does this both with the original driver and with my patched driver 
 configured for single-block or multi-block operation.  
 
 After some experimenting to find the cause of the corrupted data, I 
 realized we're violating the SD spec by running the bus at 30mhz -- 
 the spec says 25mhz max until you use CMD6 to switch to high-speed 
 mode if the card supports it.  Our next lower available speed is 
 15mhz, and when I set that as the max speed, 4-bit works perfectly, 
 both in the original driver and with my patches in single or 
 multi-block operation.  (In my patched driver I had to add a 
 controller reset following a multi-block read stop, similar to after a 
 multi-write, to avoid occasional spurious data crc errors in 4-bit 
 mode.  The data we want is read correctly; the crc error happens on 
 the block that's still coming in as the stop command is being issued.  
 I'm not sure why this only happens in 4-bit mode.) 
 
 Since we've been getting away with 30mhz/1-bit for years, I surmise 
 that any card that is capable of delivering 25mhz/4-bit is also 
 capable of doing 30mhz/1-bit even though that's a slight violation of 
 the spec.  But 30mhz/4-bit appears to be enough of a violation that 
 even modern cards don't keep up.  (When looking at dumps of the 
 corrupted read data, an old card had a lot of corruption, like 20% of 
 the data was read wrong.  A modern card had just a few bits wrong out 
 of every few kbytes read.) 
 
 Since 15mhz/4bit is still twice the data throughput of 30mhz/1bit I 
 decided to do some crude benchmarking to see if it's worth the trouble 
 of making 4-bit work correctly.  The results appear below.  In 
 summary, there is definitely a benefit to using 4-bit transfers, but 
 the improvement isn't nearly as dramatic as the change from single- to 
 multi-block IO.  
 
 Supporting 4-bit transfers properly will require some changes in 
 dev/mmc.  It doesn't currently use CMD6 to switch to high-speed mode 
 at all.  I'm assuming if we update it to do so, we'll have no problem 
 running at 30mhz/4-bit.  There'll also need to be some fixes in the 
 routine that calculates the speed to run at, because right now it 
 doesn't account for the 25mhz speed limit set by the spec before 
 switching to high-speed (which is why we end up running at 30mhz).  
 
 The mci driver will also need some updates to round down to the next 
 lower supported clock speed requested by the upper layers, but it 
 would probably be good to have a bit of a hack in there as well to 
 allow 30mhz operation in 1-bit mode since folks have come to expect 
 that and it seems to work ok.  
 
 About the benchmarks...
 
 I tested with two different cards, noted below by their erase block 
 sizes.  The card with the 32-block erase size is a SanDisk 512mb card 
 from several years ago.  The card with the 8192-block erase size is a 
 SanDisk 2gb card purchased recently.  The older card does not claim to 
 support high-speed mode, the newer card does (but of course we don't 
 switch the card to hs mode).  
 
 I tested each card with each combo of bus speed, bus width, and 
 single- versus multi-block IO.  All of the results below are with my 
 patched driver.  I also briefly tested the original unpatched 8.2 
 driver and found the results very much in line with the 1-block 
 results from my patched driver.  (The patched driver performs a little 
 better even in single-block mode, probably because it gets the same 
 work done with fewer interrupts.) 
 
 Read and write speeds are as reported by these commands:
 
   dd if=/dev/mmcsd0s2a of=/dev/null bs=1m count=10
   dd if=/dev/zero of=/dev/mmcsd0s2a bs=1m count=10
 
 Each test was run several times immediately after rebooting; median 
 values reported.  There were no writable filesystems mounted and 
 relatively little going on in the system in general, but I didn't get 
 fanatical about leveling the test conditions.  
 
 Erase/clock/bus/xfer size    Read bytes/sec   Write bytes/sec
 
   32/30MHz/1bit/1-block          864452          333324 
   32/15MHz/4bit/1-block          975780          346738 
 
 8192/30MHz/1bit/1-block          647241           24211 
 8192/15MHz/4bit/1-block          722659           24253 
 
   32/30MHz/1bit/64-block        2192806         1775660 
   32/15MHz/4bit/64-block        3075302         1775302 
 
 8192/30MHz/1bit/64-block        2133880         1503959 
 8192/15MHz/4bit/64-block        2947133         1753540 
 
 
 Another crude little benchmark...  right after booting I logged on as 
 root immediately and did a vmstat -i, so this should roughly represent 
 how many interrupts it took to get booted and launch root's shell (all 
 read IO, there are no writeable filesystems mounted, both done at 
 30mhz/1-bit): 
 
 vmstat -i                  interrupt         total       rate
 original driver (1-block)  irq10: at91_mci0  42384       1284
 patched  driver (64-block) irq10: at91_mci0   1365         52
 
 
 Based on the benchmark results, and the fact that I don't really have 
 the time to take on the dev/mmc changes right now, I think we should 
 adopt the multi-block patches and stick with 30mhz/1-bit for now.  
 Maybe I can find some time later this year to get dev/mmc working 
 better with high-speed mode (without accidentally breaking the sdhci 
 world, which I don't know enough about right now).  
 
 

From: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: arm/155214: [patch] MMC/SD IO slow on Atmel ARM with modern
 large SD cards
Date: Mon, 18 Apr 2011 12:45:28 -0600

 I have an updated patch for this which includes better error handling,
 and better read performance (mainly by splitting large IO requests into
 two DMA operations and doing the byte-swapping for the first half while
 the second half is still on the wire from the card).  It also has more
 comments about what works and what doesn't (ex: 30mhz 4-bit transfers
 when USB Host mode is also enabled).
 
 I don't see any straightforward way on the PR page to nuke the original
 patch and supply a replacement.  What's the best way to handle that?
 
 -- Ian
 
 

From: Ian Lepore <freebsd@damnhippie.dyndns.org>
To: bug-followup@FreeBSD.org, freebsd@damnhippie.dyndns.org
Cc:  
Subject: Re: arm/155214: [patch] MMC/SD IO slow on Atmel ARM with modern
 large SD cards (updated patch)
Date: Tue, 19 Apr 2011 12:06:43 -0600

 Index: sys/arm/at91/at91_mci.c
 ===================================================================
 --- sys/arm/at91/at91_mci.c.cvs_v1.18   2011-03-30 08:30:18.000000000 -0600
 +++ sys/arm/at91/at91_mci.c             2011-04-18 12:04:55.000000000 -0600
 @@ -67,7 +67,60 @@
  
  #include "opt_at91.h"
  
 -#define BBSZ	512
 +/* About running the MCI bus at 30mhz...
 + *
 + * Historically, the MCI bus has been run at 30mhz on systems with a 60mhz
 + * master clock, due to a bug in the mantissa table in dev/mmc.c making it
 + * appear that the card's max speed was always 30mhz.  Fixing that bug causes
 + * the mmc driver to request a 25mhz clock (as it should) and the logic in
 + * at91_mci_update_ios() picks the highest speed that doesn't exceed that limit.
 + * With a 60mhz MCK that would be 15mhz, and that's a real performance buzzkill
 + * when you've been getting away with 30mhz all along.
 + *
 + * By defining AT91_MCI_USE_30MHZ (or setting the 30mhz=1 device hint or sysctl)
 + * you can enable logic in at91_mci_update_ios() to set the mci bus to 30mhz
 + * when MCK is 60mhz and the requested speed is 25mhz.  This appears to work on
 + * virtually all SD cards, since it is what this driver has been doing by
 + * accident since day one.  I've seen modern SD cards run at 45mhz/1-bit in
 + * standard mode (high speed mode enable commands not sent) without problems.
 + *
 + * Speaking of high-speed mode, the rm9200 manual says the MCI device supports
 + * the SD v1.0 specification and can run up to 50mhz.  This is interesting in
 + * that the SD v1.0 spec caps the speed at 25mhz; high speed mode was added in
 + * the v1.10 spec.  Furthermore, high speed mode doesn't just crank up the
 + * clock, it alters the signal timing.  The rm9200 MCI device doesn't support
 + * these altered timings.  So while speeds over 25mhz may work, they only work
 + * in what the SD spec calls "default" speed mode, and it amounts to violating
 + * the spec by overclocking the bus.
 + *
 + * If you also enable 4-wire mode it's possible the 30mhz transfers will fail.
 + * If you have the USB host device and OHCI driver enabled it's g'teed to fail
 + * (I get intermittant overrun and underrun errors even at 15mhz 4-wire with
 + * OHCI). Note that you don't even need to have usb devices attached to the
 + * system, the errors begin to occur as soon as the OHCI driver sets the
 + * register bit to enable periodic transfers.  It appears (based on brief
 + * investigation) that the usb host controller uses so much ASB bandwidth that
 + * sometimes the DMA for MCI transfers doesn't get a bus grant in time and data
 + * gets dropped.  Adding even a modicum of network activity changes the symptom
 + * from intermittant to very frequent.
 + */
 +
 +#ifndef AT91_MCI_USE_30MHZ
 +#define AT91_MCI_USE_30MHZ 1
 +#endif
 +
 +/* Allocate 2 bounce buffers, each being sized to half the system default
 + * physical IO size.  That enables doing DFLTPHYS sized transfers at a time,
 + * with read transfers in particular being split into two operations so that we
 + * can overlap some of the byte-swapping needed due to the rm9200 erratum with
 + * the DMA for the second half of the transfer.
 + */
 +
 +#define BBCOUNT     2
 +#define BBSIZE      (DFLTPHYS/BBCOUNT)
 +#define MAX_BLOCKS  ((BBSIZE*BBCOUNT)/512)
 +
 +static int mci_debug;
  
  struct at91_mci_softc {
  	void *intrhand;			/* Interrupt handle */
 @@ -75,21 +128,26 @@
  	int sc_cap;
  #define	CAP_HAS_4WIRE		1	/* Has 4 wire bus */
  #define	CAP_NEEDS_BYTESWAP	2	/* broken hardware needing bounce */
 -	int flags;
  	int has_4wire;
 -#define CMD_STARTED	1
 -#define STOP_STARTED	2
 +	int use_30mhz;
 +	int flags;
 +#define PENDING_CMD	0x01
 +#define PENDING_STOP	0x02
 +#define CMD_MULTIREAD	0x10
 +#define CMD_MULTIWRITE	0x20
  	struct resource *irq_res;	/* IRQ resource */
  	struct resource	*mem_res;	/* Memory resource */
  	struct mtx sc_mtx;
  	bus_dma_tag_t dmatag;
 -	bus_dmamap_t map;
 -	int mapped;
  	struct mmc_host host;
  	int bus_busy;
  	struct mmc_request *req;
  	struct mmc_command *curcmd;
 -	char bounce_buffer[BBSZ];
 +	bus_dmamap_t bbuf_map[BBCOUNT];
 +	char      *  bbuf_vaddr[BBCOUNT]; /* bounce bufs in KVA space */
 +	uint32_t     bbuf_len[BBCOUNT];	  /* len currently queued for bounce buf */
 +	uint32_t     bbuf_curidx;	  /* which bbuf is the active DMA buffer */
 +	uint32_t     xfer_offset;	  /* offset so far into caller's buf */
  };
  
  static inline uint32_t
 @@ -123,6 +181,47 @@
  #define AT91_MCI_ASSERT_LOCKED(_sc)	mtx_assert(&_sc->sc_mtx, MA_OWNED);
  #define AT91_MCI_ASSERT_UNLOCKED(_sc) mtx_assert(&_sc->sc_mtx, MA_NOTOWNED);
  
 +static void 
 +at91_bswap_buf(struct at91_mci_softc *sc, void * dptr, void * sptr, uint32_t memsize)
 +{
 +	uint32_t * dst = (uint32_t *)dptr;
 +	uint32_t * src = (uint32_t *)sptr;
 +	uint32_t   i;
 +
 +	/* If the hardware doesn't need byte-swapping, let bcopy() do the work.
 +	 */
 +
 +	if (!(sc->sc_cap & CAP_NEEDS_BYTESWAP)) {
 +		bcopy(dptr, sptr, memsize);
 +		return;
 +	}
 +
 +	/* Nice performance boost for slightly unrolling this loop.
 +	 * (But very little extra boost for further unrolling it.)
 +	 */
 +
 +	for (i = 0; i < memsize; i += 16) {
 +		*dst++ = bswap32(*src++);
 +		*dst++ = bswap32(*src++);
 +		*dst++ = bswap32(*src++);
 +		*dst++ = bswap32(*src++);
 +	}
 +
 +	/* Mop up the last 1-3 words, if any. */
 +
 +	for (i = 0; i < (memsize & 0x0F); i += 4) {
 +		*dst++ = bswap32(*src++);
 +	}
 +}
 +
 +static void
 +at91_mci_getaddr(void *arg, bus_dma_segment_t *segs, int nsegs, int error)
 +{
 +	if (error != 0)
 +		return;
 +	*(bus_addr_t *)arg = segs[0].ds_addr;
 +}
 +
  static void
  at91_mci_pdc_disable(struct at91_mci_softc *sc)
  {
 @@ -137,15 +236,57 @@
  	WR4(sc, PDC_TNCR, 0);
  }
  
 +/* Reset the controller, then restore most of the current state.
 + *
 + * This is called after detecting an error.  It's also called after stopping a
 + * multi-block write, to un-wedge the device so that it will handle the NOTBUSY
 + * signal correctly.  See comments in at91_mci_stop_done() for more details.
 + */
 +static void at91_mci_reset(struct at91_mci_softc *sc)
 +{
 +	uint32_t mr;
 +	uint32_t sdcr;
 +	uint32_t dtor;
 +	uint32_t imr;
 +
 +	at91_mci_pdc_disable(sc);
 +
 +	/* save current state */
 +
 +	imr  = RD4(sc, MCI_IMR);
 +	mr   = RD4(sc, MCI_MR) & 0x7fff;
 +	sdcr = RD4(sc, MCI_SDCR);
 +	dtor = RD4(sc, MCI_DTOR);
 +
 +	/* reset the controller */
 +
 +	WR4(sc, MCI_IDR, 0xffffffff);
 +	WR4(sc, MCI_CR, MCI_CR_MCIDIS | MCI_CR_SWRST);
 +
 +	/* restore state */
 +
 +	WR4(sc, MCI_CR, MCI_CR_MCIEN|MCI_CR_PWSEN);
 +	WR4(sc, MCI_MR, mr);
 +	WR4(sc, MCI_SDCR, sdcr);
 +	WR4(sc, MCI_DTOR, dtor);
 +	WR4(sc, MCI_IER, imr);
 +
 +	/* Make sure sdio interrupts will fire.  Not sure why reading
 +	 * SR ensures that, but this is in the linux driver.
 +	 */
 +
 +	RD4(sc, MCI_SR);
 +}
 +
  static void
  at91_mci_init(device_t dev)
  {
  	struct at91_mci_softc *sc = device_get_softc(dev);
  
 -	WR4(sc, MCI_CR, MCI_CR_MCIEN);		/* Enable controller */
 +	WR4(sc, MCI_CR, MCI_CR_MCIDIS | MCI_CR_SWRST); /* Put the device into reset */
  	WR4(sc, MCI_IDR, 0xffffffff);		/* Turn off interrupts */
  	WR4(sc, MCI_DTOR, MCI_DTOR_DTOMUL_1M | 1);
 -	WR4(sc, MCI_MR, 0x834a);	// XXX GROSS HACK FROM LINUX
 +	WR4(sc, MCI_MR, 0x834a); // set PDCMODE, PWSDIV=3, CLKDIV=75
  #ifndef  AT91_MCI_SLOT_B
  	WR4(sc, MCI_SDCR, 0);			/* SLOT A, 1 bit bus */
  #else
 @@ -153,6 +294,11 @@
  	 * a two slot card that we know of. XXX */
  	WR4(sc, MCI_SDCR, 1);			/* SLOT B, 1 bit bus */
  #endif
 +	/* Enable controller, including power-save.  The slower clock of
 +	 * the power-save mode is only in effect when there is no transfer in
 +	 * progress, so it can be left in this mode all the time.
 +	 */
 +	WR4(sc, MCI_CR, MCI_CR_MCIEN|MCI_CR_PWSEN);
  }
  
  static void
 @@ -167,8 +313,8 @@
  
  static int
  at91_mci_probe(device_t dev)
 -{
  
 +{
  	device_set_desc(dev, "MCI mmc/sd host bridge");
  	return (0);
  }
 @@ -180,6 +326,7 @@
  	struct sysctl_ctx_list *sctx;
  	struct sysctl_oid *soid;
  	device_t child;
 +	int i;
  	int err;
  
  	sc->dev = dev;
 @@ -193,48 +340,100 @@
  
  	AT91_MCI_LOCK_INIT(sc);
  
 -	/*
 -	 * Allocate DMA tags and maps
 +	at91_mci_fini(dev);
 +	at91_mci_init(dev);
 +
 +	/* Allocate DMA tags and maps and bounce buffers.
 +	 *
 +	 * The parms in the tag_create call cause the dmamem_alloc call to
 +	 * create each bounce buffer as a single contiguous buffer of BBSIZE
 +	 * bytes aligned to a 4096 byte boundary.
 +	 *
 +	 * Do not use DMA_COHERENT for these buffers because that maps the
 +	 * memory as non-cachable, which prevents cache line burst fills/writes,
 +	 * which is something we need since we're trying to overlap the
 +	 * byte-swapping with the DMA operations.
  	 */
 -	err = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0,
 -	    BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, MAXPHYS, 1,
 -	    MAXPHYS, BUS_DMA_ALLOCNOW, NULL, NULL, &sc->dmatag);
 -	if (err != 0)
 -		goto out;
  
 -	err = bus_dmamap_create(sc->dmatag, 0,  &sc->map);
 +	err = bus_dma_tag_create(bus_get_dma_tag(dev), 4096, 0,
 +	    BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, 
 +	    BBSIZE, 1, MAXPHYS, 0, NULL, NULL, &sc->dmatag);
  	if (err != 0)
  		goto out;
  
 -	at91_mci_fini(dev);
 -	at91_mci_init(dev);
 +	for (i = 0; i < BBCOUNT; ++i) {
 +		err = bus_dmamem_alloc(sc->dmatag, (void **)&sc->bbuf_vaddr[i],
 +		    BUS_DMA_NOWAIT, &sc->bbuf_map[i]);
 +		if (err != 0)
 +			goto out;
 +	}
  
  	/*
  	 * Activate the interrupt
  	 */
 -	err = bus_setup_intr(dev, sc->irq_res, INTR_TYPE_MISC | INTR_MPSAFE,
 +	err = bus_setup_intr(dev, sc->irq_res, INTR_TYPE_BIO | INTR_MPSAFE,
  	    NULL, at91_mci_intr, sc, &sc->intrhand);
  	if (err) {
  		AT91_MCI_LOCK_DESTROY(sc);
  		goto out;
  	}
  
 +	/* Allow 4-wire to be initially set via #define.
 +	 * Allow a device hint to override that.
 +	 * Allow a sysctl to override that.
 +	 */
 +
 +#if defined(AT91_MCI_HAS_4WIRE) && AT91_MCI_HAS_4WIRE != 0
 +	sc->has_4wire = 1;
 +#else
 +	sc->has_4wire = 0;
 +#endif
 +
 +	resource_int_value(device_get_name(dev), device_get_unit(dev), 
 +			   "4wire", &sc->has_4wire);
 +
  	sctx = device_get_sysctl_ctx(dev);
  	soid = device_get_sysctl_tree(dev);
  	SYSCTL_ADD_UINT(sctx, SYSCTL_CHILDREN(soid), OID_AUTO, "4wire",
  	    CTLFLAG_RW, &sc->has_4wire, 0, "has 4 wire SD Card bus");
  
 -#ifdef AT91_MCI_HAS_4WIRE
 -	sc->has_4wire = 1;
 -#endif
  	if (sc->has_4wire)
  		sc->sc_cap |= CAP_HAS_4WIRE;
  
 -	sc->host.f_min = at91_master_clock / 512;
 +	/* Allow use_30mhz to be initially set via #define.
 +	 * Allow a device hint to override that.
 +	 * Allow a sysctl to override that.
 +	 */
 +
 +#if defined(AT91_MCI_USE_30MHZ) && AT91_MCI_USE_30MHZ != 0
 +	sc->use_30mhz = 1;
 +#else
 +	sc->use_30mhz = 0;
 +#endif
 +
 +	resource_int_value(device_get_name(dev), device_get_unit(dev), 
 +			   "30mhz", &sc->use_30mhz);
 +
 +	sctx = device_get_sysctl_ctx(dev);
 +	soid = device_get_sysctl_tree(dev);
 +	SYSCTL_ADD_UINT(sctx, SYSCTL_CHILDREN(soid), OID_AUTO, "30mhz",
 +	    CTLFLAG_RW, &sc->use_30mhz, 0, "use 30mhz clock for 25mhz request");
 +
 +	/* Our real min freq is master_clock/512, but upper driver layers are
 +	 * going to set the min speed during card discovery, and the right speed
 +	 * for that is 400khz, so advertise a safe value just under that.
 +	 *
 +	 * For max speed, while the rm9200 manual says the max is 50mhz, it also
 +	 * says it supports only the SD v1.0 spec, which means the real limit is
 +	 * 25mhz. On the other hand, historical use has been to slightly violate
 +	 * the standard by running the bus at 30mhz.  For more information on
 +	 * that, see the comments at the top of this file.
 +	 */
 +
  	sc->host.f_min = 375000;
  	sc->host.f_max = at91_master_clock / 2;
 -	if (sc->host.f_max > 50000000)	
 -		sc->host.f_max = 50000000;	/* Limit to 50MHz */
 +	if (sc->host.f_max > 25000000)
 +		sc->host.f_max = 25000000;
  
  	sc->host.host_ocr = MMC_OCR_320_330 | MMC_OCR_330_340;
  	sc->host.caps = 0;
 @@ -252,8 +451,15 @@
  static int
  at91_mci_detach(device_t dev)
  {
 +	struct at91_mci_softc *sc = device_get_softc(dev);
 +
  	at91_mci_fini(dev);
  	at91_mci_deactivate(dev);
 +
 +	bus_dmamem_free(sc->dmatag, sc->bbuf_vaddr[0], sc->bbuf_map[0]);
 +	bus_dmamem_free(sc->dmatag, sc->bbuf_vaddr[1], sc->bbuf_map[1]);
 +	bus_dma_tag_destroy(sc->dmatag);
 +
  	return (EBUSY);	/* XXX */
  }
  
 @@ -293,7 +499,7 @@
  	sc->intrhand = 0;
  	bus_generic_detach(sc->dev);
  	if (sc->mem_res)
 -		bus_release_resource(dev, SYS_RES_IOPORT,
 +		bus_release_resource(dev, SYS_RES_MEMORY,
  		    rman_get_rid(sc->mem_res), sc->mem_res);
  	sc->mem_res = 0;
  	if (sc->irq_res)
 @@ -303,14 +509,6 @@
  	return;
  }
  
 -static void
 -at91_mci_getaddr(void *arg, bus_dma_segment_t *segs, int nsegs, int error)
 -{
 -	if (error != 0)
 -		return;
 -	*(bus_addr_t *)arg = segs[0].ds_addr;
 -}
 -
  static int
  at91_mci_update_ios(device_t brdev, device_t reqdev)
  {
 @@ -322,16 +520,31 @@
  	sc = device_get_softc(brdev);
  	host = &sc->host;
  	ios = &host->ios;
 -	// bus mode?
 +
 +	/* Calculate our closest available clock speed that doesn't exceed the
 +	 * requested speed.
 +	 *
 +	 * If the master clock is running at 60mhz and the requested bus speed
 +	 * is 25mhz and the use_30mhz flag is on, set clkdiv to zero to get a
 +	 * 30mhz mci clock. See comments near the top of the file for more info.
 +	 *
 +	 * Whatever we come up with, store it back into ios->clock so that the
 +	 * upper layer drivers can report the actual speed of the bus.
 +	 */
 +
  	if (ios->clock == 0) {
  		WR4(sc, MCI_CR, MCI_CR_MCIDIS);
  		clkdiv = 0;
  	} else {
 -		WR4(sc, MCI_CR, MCI_CR_MCIEN);
 -		if ((at91_master_clock % (ios->clock * 2)) == 0)
 +		WR4(sc, MCI_CR, MCI_CR_MCIEN|MCI_CR_PWSEN);
 +		if (sc->use_30mhz && at91_master_clock == 60000000 && 
 +		    ios->clock == 25000000)
 +			clkdiv = 0;
 +                else if ((at91_master_clock % (ios->clock * 2)) == 0)
  			clkdiv = ((at91_master_clock / ios->clock) / 2) - 1;
  		else
  			clkdiv = (at91_master_clock / ios->clock) / 2;
 +		ios->clock = at91_master_clock / ((clkdiv+1) * 2);
  	}
  	if (ios->bus_width == bus_width_4)
  		WR4(sc, MCI_SDCR, RD4(sc, MCI_SDCR) | MCI_SDCR_SDCBUS);
 @@ -346,137 +559,247 @@
  static void
  at91_mci_start_cmd(struct at91_mci_softc *sc, struct mmc_command *cmd)
  {
 -	uint32_t cmdr, ier = 0, mr;
 -	uint32_t *src, *dst;
 -	int i;
 +	uint32_t cmdr, mr;
  	struct mmc_data *data;
 -	void *vaddr;
 -	bus_addr_t paddr;
  
  	sc->curcmd = cmd;
  	data = cmd->data;
 -	cmdr = cmd->opcode;
  
  	/* XXX Upper layers don't always set this */
  	cmd->mrq = sc->req;
  
 +	/* Begin setting up command register. */
 +
 +	cmdr = cmd->opcode;
 +
 +	if (sc->host.ios.bus_mode == opendrain)
 +		cmdr |= MCI_CMDR_OPDCMD;
 +
 +	/* Set up response handling.  Allow max timeout for responses. */
 +
  	if (MMC_RSP(cmd->flags) == MMC_RSP_NONE)
  		cmdr |= MCI_CMDR_RSPTYP_NO;
  	else {
 -		/* Allow big timeout for responses */
  		cmdr |= MCI_CMDR_MAXLAT;
  		if (cmd->flags & MMC_RSP_136)
  			cmdr |= MCI_CMDR_RSPTYP_136;
  		else
  			cmdr |= MCI_CMDR_RSPTYP_48;
  	}
 -	if (cmd->opcode == MMC_STOP_TRANSMISSION)
 -		cmdr |= MCI_CMDR_TRCMD_STOP;
 -	if (sc->host.ios.bus_mode == opendrain)
 -		cmdr |= MCI_CMDR_OPDCMD;
 -	if (!data) {
 -		// The no data case is fairly simple
 +
 +	/* If there is no data transfer, just set up the right interrupt mask
 +	 * and start the command.
 +	 *
 +	 * The interrupt mask needs to be CMDRDY plus all non-data-transfer
 +	 * errors. It's important to leave the transfer-related errors out, to
 +	 * avoid spurious timeout or crc errors on a STOP command following a
 +	 * multiblock read.  When a multiblock read is in progress, sending a
 +	 * STOP in the middle of a block occasionally triggers such errors, but
 +	 * we're totally disinterested in them because we've already gotten all
 +	 * the data we wanted without error before sending the STOP command.
 +	 */
 +
 +	if (data == NULL) {
 +		uint32_t ier = MCI_SR_CMDRDY | 
 +                    MCI_SR_RTOE | MCI_SR_RENDE | 
 +		    MCI_SR_RCRCE | MCI_SR_RDIRE | MCI_SR_RINDE;
 +
  		at91_mci_pdc_disable(sc);
 -//		printf("CMDR %x ARGR %x\n", cmdr, cmd->arg);
 +
 +		if (cmd->opcode == MMC_STOP_TRANSMISSION)
 +			cmdr |= MCI_CMDR_TRCMD_STOP;
 +
 +		/* Ignore response CRC on CMD2 and ACMD41, per standard. */
 +
 +		if (cmd->opcode == MMC_SEND_OP_COND ||
 +		    cmd->opcode == ACMD_SD_SEND_OP_COND)
 +			ier &= ~MCI_SR_RCRCE;
 +
 +		if (mci_debug)
 +			printf("CMDR %x (opcode %d) ARGR %x no data\n", 
 +			    cmdr, cmd->opcode, cmd->arg);
 +
  		WR4(sc, MCI_ARGR, cmd->arg);
  		WR4(sc, MCI_CMDR, cmdr);
 -		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_CMDRDY);
 +		WR4(sc, MCI_IDR, 0xffffffff);
 +		WR4(sc, MCI_IER, ier);
  		return;
  	}
 +
 +	/* There is data, set up the transfer-related parts of the command. */
 +
  	if (data->flags & MMC_DATA_READ)
  		cmdr |= MCI_CMDR_TRDIR;
 +
  	if (data->flags & (MMC_DATA_READ | MMC_DATA_WRITE))
  		cmdr |= MCI_CMDR_TRCMD_START;
 +
  	if (data->flags & MMC_DATA_STREAM)
  		cmdr |= MCI_CMDR_TRTYP_STREAM;
 -	if (data->flags & MMC_DATA_MULTI)
 +	else if (data->flags & MMC_DATA_MULTI) {
  		cmdr |= MCI_CMDR_TRTYP_MULTIPLE;
 -	// Set block size and turn on PDC mode for dma xfer and disable
 -	// PDC until we're ready.
 -	mr = RD4(sc, MCI_MR) & ~MCI_MR_BLKLEN;
 -	WR4(sc, MCI_MR, mr | (data->len << 16) | MCI_MR_PDCMODE);
 -	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 -	if (cmdr & MCI_CMDR_TRCMD_START) {
 -		if (cmdr & MCI_CMDR_TRDIR)
 -			vaddr = cmd->data->data;
 -		else {
 -			/* Use bounce buffer even if we don't need
 -			 * byteswap, since buffer may straddle a page
 -			 * boundry, and we don't handle multi-segment
 -			 * transfers in hardware.
 -			 * (page issues seen from 'bsdlabel -w' which
 -			 * uses raw geom access to the volume).
 -			 * Greg Ansley (gja (at) ansley.com)
 -			 */
 -			vaddr = sc->bounce_buffer;
 -			src = (uint32_t *)cmd->data->data;
 -			dst = (uint32_t *)vaddr;
 -			if (sc->sc_cap & CAP_NEEDS_BYTESWAP) {
 -				for (i = 0; i < data->len / 4; i++)
 -					dst[i] = bswap32(src[i]);
 -			} else
 -				memcpy(dst, src, data->len);
 -		}
 -		data->xfer_len = 0;
 -		if (bus_dmamap_load(sc->dmatag, sc->map, vaddr, data->len,
 -		    at91_mci_getaddr, &paddr, 0) != 0) {
 -			cmd->error = MMC_ERR_NO_MEMORY;
 -			sc->req = NULL;
 -			sc->curcmd = NULL;
 -			cmd->mrq->done(cmd->mrq);
 -			return;
 -		}
 -		sc->mapped++;
 -		if (cmdr & MCI_CMDR_TRDIR) {
 -			bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_PREREAD);
 -			WR4(sc, PDC_RPR, paddr);
 -			WR4(sc, PDC_RCR, data->len / 4);
 -			ier = MCI_SR_ENDRX;
 -		} else {
 -			bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_PREWRITE);
 -			WR4(sc, PDC_TPR, paddr);
 -			WR4(sc, PDC_TCR, data->len / 4);
 -			ier = MCI_SR_TXBUFE;
 -		}
 +		sc->flags |= (data->flags & MMC_DATA_READ) ? 
 +				CMD_MULTIREAD : CMD_MULTIWRITE;
  	}
 -//	printf("CMDR %x ARGR %x with data\n", cmdr, cmd->arg);
 -	WR4(sc, MCI_ARGR, cmd->arg);
 -	if (cmdr & MCI_CMDR_TRCMD_START) {
 -		if (cmdr & MCI_CMDR_TRDIR) {
 +
 +	/* Disable PDC until we're ready.
 +	 *
 +	 * Set block size and turn on PDC mode for dma xfer.
 +	 * Note that the block size is the smaller of the amount of data to be
 +	 * transferred, or 512 bytes.  The 512 size is fixed by the standard;
 +	 * smaller blocks are possible, but never larger.
 +	 */
 +
 +	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS); 
 +
 +	mr = RD4(sc,MCI_MR) & ~MCI_MR_BLKLEN; 
 +	mr |=  min(data->len, 512) << 16; 
 +	WR4(sc, MCI_MR, mr | MCI_MR_PDCMODE|MCI_MR_PDCPADV);
 +
 +	/* Set up DMA.
 +	 *
 +	 * Use bounce buffers even if we don't need to byteswap, because doing
 +	 * multi-block IO with large DMA buffers is way fast (compared to
 +	 * single-block IO), even after incurring the overhead of also copying
 +	 * from/to the caller's buffers (which may be in non-contiguous physical
 +	 * pages).
 +	 *
 +	 * In an ideal non-byteswap world we could create a dma tag that allows
 +	 * for discontiguous segments and do the IO directly from/to the
 +	 * caller's buffer(s), using ENDRX/ENDTX interrupts to chain the
 +	 * discontiguous buffers through the PDC. Someday.
 +	 *
 +         * If a read is bigger than 2k, split it in half so that we can start
 +	 * byte-swapping the first half while the second half is on the wire.
 +	 * It would be best if we could split it into 8k chunks, but we can't
 +	 * always keep up with the byte-swapping due to other system activity,
 +	 * and if an RXBUFF interrupt happens while we're still handling the
 +	 * byte-swap from the prior buffer (IE, we haven't returned from
 +	 * handling the prior interrupt yet), then data will get dropped on the
 +	 * floor and we can't easily recover from that.  The right fix for that
 +	 * would be to have the interrupt handling only keep the DMA flowing and
 +	 * enqueue filled buffers to be byte-swapped in a non-interrupt context.
 +	 * Even that won't work on the write side of things though; in that
 +	 * context we have to have all the data ready to go before starting the
 +	 * dma.
 +	 *
 +	 * XXX what about stream transfers?
 +	 */
 +
 +	sc->xfer_offset = 0;
 +	sc->bbuf_curidx = 0;
 +
 +	if (data->flags & (MMC_DATA_READ | MMC_DATA_WRITE)) {
 +		uint32_t len;
 +		uint32_t remaining = data->len;
 +		bus_addr_t paddr;
 +		int err;
 +
 +		if (remaining > (BBCOUNT*BBSIZE))
 +			panic("IO read size exceeds MAXDATA\n");
 +
 +		if (data->flags & MMC_DATA_READ) {
 +			if (remaining > 2048)
 +				len = remaining / 2;
 +			else
 +				len = remaining;
 +			err = bus_dmamap_load(sc->dmatag, sc->bbuf_map[0], 
 +			    sc->bbuf_vaddr[0], len, at91_mci_getaddr, 
 +			    &paddr, BUS_DMA_NOWAIT);
 +			if (err != 0)
 +				panic("IO read dmamap_load failed\n");
 +			bus_dmamap_sync(sc->dmatag, sc->bbuf_map[0], 
 +			    BUS_DMASYNC_PREREAD);
 +			WR4(sc, PDC_RPR, paddr);
 +			WR4(sc, PDC_RCR, len / 4);
 +			sc->bbuf_len[0] = len;
 +			remaining -= len;
 +			if (remaining == 0) {
 +				sc->bbuf_len[1] = 0;
 +			} else {
 +				len = remaining;
 +				err = bus_dmamap_load(sc->dmatag, sc->bbuf_map[1], 
 +				    sc->bbuf_vaddr[1], len, at91_mci_getaddr, 
 +				    &paddr, BUS_DMA_NOWAIT);
 +				if (err != 0)
 +					panic("IO read dmamap_load failed\n");
 +				bus_dmamap_sync(sc->dmatag, sc->bbuf_map[1], 
 +				    BUS_DMASYNC_PREREAD);
 +				WR4(sc, PDC_RNPR, paddr);
 +				WR4(sc, PDC_RNCR, len / 4);
 +				sc->bbuf_len[1] = len;
 +				remaining -= len;
 +			}
  			WR4(sc, PDC_PTCR, PDC_PTCR_RXTEN);
 -			WR4(sc, MCI_CMDR, cmdr);
  		} else {
 -			WR4(sc, MCI_CMDR, cmdr);
 -			WR4(sc, PDC_PTCR, PDC_PTCR_TXTEN);
 +			len = min(BBSIZE, remaining);
 +			at91_bswap_buf(sc, sc->bbuf_vaddr[0], data->data, len);
 +			err = bus_dmamap_load(sc->dmatag, sc->bbuf_map[0], 
 +			    sc->bbuf_vaddr[0], len, at91_mci_getaddr, 
 +			    &paddr, BUS_DMA_NOWAIT);
 +			if (err != 0)
 +				panic("IO write dmamap_load failed\n");
 +			bus_dmamap_sync(sc->dmatag, sc->bbuf_map[0], 
 +			    BUS_DMASYNC_PREWRITE);
 +			WR4(sc, PDC_TPR,paddr);
 +			WR4(sc, PDC_TCR, len / 4);
 +			sc->bbuf_len[0] = len;
 +			remaining -= len;
 +			if (remaining == 0) {
 +				sc->bbuf_len[1] = 0;
 +			} else {
 +				len = remaining;
 +				at91_bswap_buf(sc, sc->bbuf_vaddr[1],
 +				    ((char *)data->data)+BBSIZE, len);
 +				err = bus_dmamap_load(sc->dmatag, sc->bbuf_map[1], 
 +				    sc->bbuf_vaddr[1], len, at91_mci_getaddr, 
 +				    &paddr, BUS_DMA_NOWAIT);
 +				if (err != 0)
 +					panic("IO write dmamap_load failed\n");
 +				bus_dmamap_sync(sc->dmatag, sc->bbuf_map[1], 
 +				    BUS_DMASYNC_PREWRITE);
 +				WR4(sc, PDC_TNPR, paddr);
 +				WR4(sc, PDC_TNCR, len / 4);
 +				sc->bbuf_len[1] = len;
 +				remaining -= len;
 +			}
 +			/* do not enable PDC xfer until CMDRDY asserted */
  		}
 +		data->xfer_len = 0; /* XXX what's this? appears to be unused. */
  	}
 -	WR4(sc, MCI_IER, MCI_SR_ERROR | ier);
 +
 +	if (mci_debug)
 +		printf("CMDR %x (opcode %d) ARGR %x with data len %d\n", 
 +		       cmdr, cmd->opcode, cmd->arg, cmd->data->len);
 +
 +	WR4(sc, MCI_ARGR, cmd->arg);
 +	WR4(sc, MCI_CMDR, cmdr);
 +	WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_CMDRDY);
  }
  
  static void
 -at91_mci_start(struct at91_mci_softc *sc)
 +at91_mci_next_operation(struct at91_mci_softc *sc)
  {
  	struct mmc_request *req;
  
  	req = sc->req;
  	if (req == NULL)
  		return;
 -	// assert locked
 -	if (!(sc->flags & CMD_STARTED)) {
 -		sc->flags |= CMD_STARTED;
 -//		printf("Starting CMD\n");
 +
 +	if (sc->flags & PENDING_CMD) {
 +		sc->flags &= ~PENDING_CMD;
  		at91_mci_start_cmd(sc, req->cmd);
  		return;
 -	}
 -	if (!(sc->flags & STOP_STARTED) && req->stop) {
 -//		printf("Starting Stop\n");
 -		sc->flags |= STOP_STARTED;
 +	} else if (sc->flags & PENDING_STOP) {
 +		sc->flags &= ~PENDING_STOP;
  		at91_mci_start_cmd(sc, req->stop);
  		return;
  	}
 -	/* We must be done -- bad idea to do this while locked? */
 +
 +	WR4(sc, MCI_IDR, 0xffffffff);
  	sc->req = NULL;
  	sc->curcmd = NULL;
 +	//printf("req done\n");
  	req->done(req);
  }
  
 @@ -486,16 +809,16 @@
  	struct at91_mci_softc *sc = device_get_softc(brdev);
  
  	AT91_MCI_LOCK(sc);
 -	// XXX do we want to be able to queue up multiple commands?
 -	// XXX sounds like a good idea, but all protocols are sync, so
 -	// XXX maybe the idea is naive...
  	if (sc->req != NULL) {
  		AT91_MCI_UNLOCK(sc);
  		return (EBUSY);
  	}
 +	//printf("new req\n");
  	sc->req = req;
 -	sc->flags = 0;
 -	at91_mci_start(sc);
 +	sc->flags = PENDING_CMD;
 +	if (sc->req->stop)
 +		sc->flags |= PENDING_STOP;
 +	at91_mci_next_operation(sc);
  	AT91_MCI_UNLOCK(sc);
  	return (0);
  }
 @@ -533,120 +856,341 @@
  }
  
  static void
 -at91_mci_read_done(struct at91_mci_softc *sc)
 +at91_mci_read_done(struct at91_mci_softc *sc, uint32_t sr)
  {
 -	uint32_t *walker;
 -	struct mmc_command *cmd;
 -	int i, len;
 -
 -	cmd = sc->curcmd;
 -	bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_POSTREAD);
 -	bus_dmamap_unload(sc->dmatag, sc->map);
 -	sc->mapped--;
 -	if (sc->sc_cap & CAP_NEEDS_BYTESWAP) {
 -		walker = (uint32_t *)cmd->data->data;
 -		len = cmd->data->len / 4;
 -		for (i = 0; i < len; i++)
 -			walker[i] = bswap32(walker[i]);
 -	}
 -	// Finish up the sequence...
 -	WR4(sc, MCI_IDR, MCI_SR_ENDRX);
 -	WR4(sc, MCI_IER, MCI_SR_RXBUFF);
 -	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 +	struct mmc_command *cmd = sc->curcmd;
 +	char * dataptr = (char *)cmd->data->data;
 +	uint32_t curidx = sc->bbuf_curidx;
 +	uint32_t len = sc->bbuf_len[curidx];
 +
 +	/* We arrive here when a DMA transfer for a read is done, whether it's a
 +	 * single or multi-block read.
 +	 *
 +	 * We byte-swap the buffer that just completed, and if that is the last
 +	 * buffer that's part of this read then we move on to the next
 +	 * operation, otherwise we wait for another ENDRX for the next bufer.
 +	 */
 +
 +	bus_dmamap_sync(sc->dmatag, sc->bbuf_map[curidx], BUS_DMASYNC_POSTREAD);
 +	bus_dmamap_unload(sc->dmatag, sc->bbuf_map[curidx]);
 +
 +	at91_bswap_buf(sc, dataptr + sc->xfer_offset, sc->bbuf_vaddr[curidx], len);
 +
 +	if (mci_debug) {
 +		printf("read done sr %x curidx %d len %d xfer_offset %d\n",
 +		       sr, curidx, len, sc->xfer_offset);
 +	}
 +
 +	sc->xfer_offset += len;
 +	sc->bbuf_curidx = !curidx; /* swap buffers */
 +
 +	/* If we've transferred all the data, move on to the next operation.
 +	 *
 +	 * If we're still transferring the last buffer, RNCR is already zero but
 +	 * we have to write a zero anyway to clear the ENDRX status so we don't
 +	 * re-interrupt until the last buffer is done.
 +	 */
 +
 +	if (sc->xfer_offset == cmd->data->len) {
 +		WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 +		cmd->error = MMC_ERR_NONE;
 +		at91_mci_next_operation(sc);
 +	} else {
 +		WR4(sc, PDC_RNCR, 0);
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_ENDRX);
 +	}
  }
  
  static void
 -at91_mci_xmit_done(struct at91_mci_softc *sc)
 +at91_mci_write_done(struct at91_mci_softc *sc, uint32_t sr)
  {
 -	// Finish up the sequence...
 +	struct mmc_command *cmd = sc->curcmd;
 +
 +	/* We arrive here when the entire DMA transfer for a write is done,
 +	 * whether it's a single or multi-block write.  If it's multi-block we
 +	 * have to immediately move on to the next operation which is to send
 +	 * the stop command.  If it's a single-block transfer we need to wait
 +	 * for NOTBUSY, but if that's already asserted we can avoid another
 +	 * interrupt and just move on to completing the request right away.
 +	 */
 +
  	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 -	WR4(sc, MCI_IDR, MCI_SR_TXBUFE);
 -	WR4(sc, MCI_IER, MCI_SR_NOTBUSY);
 -	bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_POSTWRITE);
 -	bus_dmamap_unload(sc->dmatag, sc->map);
 -	sc->mapped--;
 +
 +	bus_dmamap_sync(sc->dmatag, sc->bbuf_map[sc->bbuf_curidx], BUS_DMASYNC_POSTWRITE);
 +	bus_dmamap_unload(sc->dmatag, sc->bbuf_map[sc->bbuf_curidx]);
 +
 +	if ((cmd->data->flags & MMC_DATA_MULTI) || (sr & MCI_SR_NOTBUSY)) {
 +                cmd->error = MMC_ERR_NONE;
 +                at91_mci_next_operation(sc);
 +	} else {
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
 +	}
 +}
 +
 +static void
 +at91_mci_notbusy(struct at91_mci_softc *sc)
 +{
 +	struct mmc_command *cmd = sc->curcmd;
 +
 +	/* We arrive here by either completion of a single-block write, or
 +	 * completion of the stop command that ended a multi-block write (and, I
 +	 * suppose, after a card-select or erase, but I haven't tested those).
 +	 * Anyway, we're done and it's time to move on to the next command.
 +	 */
 +
 +	cmd->error = MMC_ERR_NONE;
 +	at91_mci_next_operation(sc);
 +}
 +
 +static void
 +at91_mci_stop_done(struct at91_mci_softc *sc, uint32_t sr)
 +{
 +	struct mmc_command *cmd = sc->curcmd;
 +
 +	/* We arrive here after receiving CMDRDY for a MMC_STOP_TRANSMISSION
 +	 * command.  Depending on the operation being stopped, we may have to do
 +	 * some unusual things to work around hardware bugs.
 +	 */
 +
 +	/* This is known to be true of at91rm9200 hardware; it may or may not
 +	 * apply to more recent chips: 
 +	 *
 +	 * After stopping a multi-block write, the NOTBUSY bit in MCI_SR does
 +	 * not properly reflect the actual busy state of the card as signaled on
 +	 * the DAT0 line; it always claims the card is not-busy.  If we believe
 +	 * that and let operations continue, following commands will fail with
 +	 * response timeouts (except of course MMC_SEND_STATUS -- it indicates
 +	 * the card is busy in the PRG state, which was the smoking gun that
 +	 * showed MCI_SR NOTBUSY was not tracking DAT0 correctly).
 +	 *
 +	 * The atmel docs are emphatic: "This flag [NOTBUSY] must be used only
 +	 * for Write Operations."  I guess technically since we sent a stop it's
 +	 * not a write operation anymore.  But then just what did they think it
 +	 * meant for the stop command to have "...an optional busy signal
 +	 * transmitted on the data line" according to the SD spec?
 +	 *
 +	 * I tried a variety of things to un-wedge the MCI and get the status
 +	 * register to reflect NOTBUSY correctly again, but the only thing that
 +	 * worked was a full device reset.  It feels like an awfully big hammer,
 +	 * but doing a full reset after every multiblock write is still faster
 +	 * than doing single-block IO (by almost two orders of magnitude:
 +	 * 20KB/sec improves to about 1.8MB/sec best case).
 +	 *
 +	 * After doing the reset, wait for a NOTBUSY interrupt before continuing
 +	 * with the next operation.
 +	 */
 +
 +	if (sc->flags & CMD_MULTIWRITE) {
 +		at91_mci_reset(sc);
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
 +		return;
 +	}
 +
 +	/* This is known to be true of at91rm9200 hardware; it may or may not
 +	 * apply to more recent chips: 
 +	 *
 +	 * After stopping a multi-block read, loop to read and discard any data
 +	 * that coasts in after we sent the stop command.  The docs don't say
 +	 * anything about it, but empirical testing shows that 1-3 additional
 +	 * words of data get buffered up in some unmentioned internal fifo and
 +	 * if we don't read and discard them here they end up on the front of
 +	 * the next read DMA transfer we do.
 +	 */
 +
 +	if (sc->flags & CMD_MULTIREAD) {
 +		uint32_t sr;
 +		int count = 0;
 +		do {
 +			sr = RD4(sc, MCI_SR);
 +			if (sr & MCI_SR_RXRDY) {
 +				RD4(sc,  MCI_RDR);
 +				++count;
 +			}
 +		} while (sr & MCI_SR_RXRDY);
 +		at91_mci_reset(sc);
 +//              if (count != 0)
 +//                      printf("Had to soak up %d words after read\n", count);
 +	}
 +
 +	cmd->error = MMC_ERR_NONE;
 +	at91_mci_next_operation(sc);
 +
 +}
 +
 +static void
 +at91_mci_cmdrdy(struct at91_mci_softc *sc, uint32_t sr)
 +{
 +	struct mmc_command *cmd = sc->curcmd;
 +	int i;
 +
 +	if (cmd == NULL)
 +		return;
 +
 +	/* We get here at the end of EVERY command.  We retrieve the command
 +	 * response (if any) then decide what to do next based on the command.
 +	 */
 +
 +	if (cmd->flags & MMC_RSP_PRESENT) {
 +		for (i = 0; i < ((cmd->flags & MMC_RSP_136) ? 4 : 1); i++) {
 +			cmd->resp[i] = RD4(sc, MCI_RSPR + i * 4);
 +			if (mci_debug)
 +				printf("RSPR[%d] = %x sr=%x\n", i, cmd->resp[i],  sr);
 +		}
 +	}
 +
 +	/* If this was a stop command, go handle the various special
 +	 * conditions (read: bugs) that have to be dealt with following a stop.
 +	 */
 +
 +	if (cmd->opcode == MMC_STOP_TRANSMISSION) {
 +		at91_mci_stop_done(sc, sr);
 +		return;
 +	}
 +
 +	/* If this command can continue to assert BUSY beyond the response then
 +	 * we need to wait for NOTBUSY before the command is really done.
 +	 *
 +	 * Note that this may not work properly on the at91rm9200.  It certainly
 +	 * doesn't work for the STOP command that follows a multi-block write,
 +	 * so post-stop CMDRDY is handled separately; see the special handling
 +	 * in at91_mci_stop_done().
 +	 *
 +	 * Beside STOP, there are other R1B-type commands that use the busy
 +	 * signal after CMDRDY: CMD7 (card select), CMD28-29 (write protect),
 +	 * CMD38 (erase). I haven't tested any of them, but I rather expect
 +	 * them all to have the same sort of problem with MCI_SR not actually
 +	 * reflecting the state of the DAT0-line busy indicator.  So this code
 +	 * may need to grow some sort of special handling for them too. (This
 +	 * just in: CMD7 isn't a problem right now because dev/mmc.c incorrectly
 +	 * sets the response flags to R1 rather than R1B.)
 +	 */
 +
 +	if ((cmd->flags & MMC_RSP_BUSY)) {
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
 +		return;
 +	}
 +
 +	/* If there is a data transfer with this command, then...
 +	 * - If it's a read, we need to wait for ENDRX.
 +	 * - If it's a write, now is the time to enable the PDC, and we need to
 +	 *   wait for a BLKE that follows a TXBUFE, because if we're doing a
 +	 *   split transfer we get a BLKE after the first half (when TPR/TCR get
 +	 *   loaded from TNPR/TNCR).  So first we wait for the TXBUFE, and the
 +	 *   handling for that interrupt will then invoke the wait for the
 +	 *   subsequent BLKE which indicates actual completion.
 +	 */
 +
 +	if (cmd->data) {
 +		uint32_t ier;
 +		if (cmd->data->flags & MMC_DATA_READ) {
 +			ier = MCI_SR_ENDRX;
 +		} else {
 +			ier = MCI_SR_TXBUFE;
 +			WR4(sc, PDC_PTCR, PDC_PTCR_TXTEN);
 +		}
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | ier);
 +		return;
 +	}
 +
 +	/* If we made it to here, we don't need to wait for anything more for
 +	 * the current command, move on to the next command (will complete the
 +	 * request if there is no next command).
 +	 */
 +
 +	cmd->error = MMC_ERR_NONE;
 +	at91_mci_next_operation(sc);
  }
  
  static void
  at91_mci_intr(void *arg)
  {
  	struct at91_mci_softc *sc = (struct at91_mci_softc*)arg;
 -	uint32_t sr;
 -	int i, done = 0;
 -	struct mmc_command *cmd;
 +	struct mmc_command *cmd = sc->curcmd;
 +	uint32_t sr, isr;
  
  	AT91_MCI_LOCK(sc);
 -	sr = RD4(sc, MCI_SR) & RD4(sc, MCI_IMR);
 -//	printf("i 0x%x\n", sr);
 -	cmd = sc->curcmd;
 -	if (sr & MCI_SR_ERROR) {
 -		// Ignore CRC errors on CMD2 and ACMD47, per relevant standards
 -		if ((sr & MCI_SR_RCRCE) && (cmd->opcode == MMC_SEND_OP_COND ||
 -		    cmd->opcode == ACMD_SD_SEND_OP_COND))
 -			cmd->error = MMC_ERR_NONE;
 -		else if (sr & (MCI_SR_RTOE | MCI_SR_DTOE))
 +
 +	sr = RD4(sc, MCI_SR);
 +	isr = sr & RD4(sc, MCI_IMR);
 +
 +	if (mci_debug)
 +		printf("i 0x%x sr 0x%x\n", isr, sr);
 +
 +	/* All interrupts are one-shot; disable it now.
 +	 * The next operation will re-enable whatever interrupts it wants.
 +	 */
 +
 +	WR4(sc, MCI_IDR, isr);
 +
 +	if (isr & MCI_SR_ERROR) {
 +		if (isr & (MCI_SR_RTOE | MCI_SR_DTOE))
  			cmd->error = MMC_ERR_TIMEOUT;
 -		else if (sr & (MCI_SR_RCRCE | MCI_SR_DCRCE))
 +		else if (isr & (MCI_SR_RCRCE | MCI_SR_DCRCE))
  			cmd->error = MMC_ERR_BADCRC;
 -		else if (sr & (MCI_SR_OVRE | MCI_SR_UNRE))
 +		else if (isr & (MCI_SR_OVRE | MCI_SR_UNRE))
  			cmd->error = MMC_ERR_FIFO;
  		else
  			cmd->error = MMC_ERR_FAILED;
 -		done = 1;
 -		if (sc->mapped && cmd->error) {
 -			bus_dmamap_unload(sc->dmatag, sc->map);
 -			sc->mapped--;
 +		/* CMD8 is used to probe for SDHC cards, a standard SD card will
 +		 * get a response timeout; don't report it because it's a normal
 +		 * and expected condition.  One might argue that all error
 +		 * reporting should be left to higher levels, but when they
 +		 * report at all it's always EIO, which isn't very helpful.
 +		 */
 +		if (cmd->opcode != 8) {
 +			device_printf(sc->dev, 
 +			    "IO error; status MCI_SR = 0x%x cmd opcode = %d%s\n",  
 +			    sr, cmd->opcode,
 +			    (cmd->opcode != 12) ? "" : 
 +			    (sc->flags & CMD_MULTIREAD) ? " after read" : " after write");
 +			at91_mci_reset(sc);
  		}
 +		at91_mci_next_operation(sc);
  	} else {
 -		if (sr & MCI_SR_TXBUFE) {
 +		if (isr & MCI_SR_TXBUFE) {
  //			printf("TXBUFE\n");
 -			at91_mci_xmit_done(sc);
 +			/* We need to wait for a BLKE that follows TXBUFE
 +			 * (intermediate BLKEs might happen after ENDTXes if
 +			 * we're chaining multiple buffers).  If BLKE is also
 +			 * asserted at the time we get TXBUFE, we can avoid
 +			 * another interrupt and process it right away, below.
 +			 */
 +			if (sr & MCI_SR_BLKE)
 +				isr |= MCI_SR_BLKE;
 +			else
 +				WR4(sc, MCI_IER, MCI_SR_BLKE);
  		}
 -		if (sr & MCI_SR_RXBUFF) {
 +		if (isr & MCI_SR_RXBUFF) {
  //			printf("RXBUFF\n");
 -			WR4(sc, MCI_IDR, MCI_SR_RXBUFF);
 -			WR4(sc, MCI_IER, MCI_SR_CMDRDY);
  		}
 -		if (sr & MCI_SR_ENDTX) {
 +		if (isr & MCI_SR_ENDTX) {
  //			printf("ENDTX\n");
  		}
 -		if (sr & MCI_SR_ENDRX) {
 +		if (isr & MCI_SR_ENDRX) {
  //			printf("ENDRX\n");
 -			at91_mci_read_done(sc);
 +			at91_mci_read_done(sc, sr);
  		}
 -		if (sr & MCI_SR_NOTBUSY) {
 +		if (isr & MCI_SR_NOTBUSY) {
  //			printf("NOTBUSY\n");
 -			WR4(sc, MCI_IDR, MCI_SR_NOTBUSY);
 -			WR4(sc, MCI_IER, MCI_SR_CMDRDY);
 +			at91_mci_notbusy(sc);
  		}
 -		if (sr & MCI_SR_DTIP) {
 +		if (isr & MCI_SR_DTIP) {
  //			printf("Data transfer in progress\n");
  		}
 -		if (sr & MCI_SR_BLKE) {
 +		if (isr & MCI_SR_BLKE) {
  //			printf("Block transfer end\n");
 +			at91_mci_write_done(sc, sr);
  		}
 -		if (sr & MCI_SR_TXRDY) {
 +		if (isr & MCI_SR_TXRDY) {
  //			printf("Ready to transmit\n");
  		}
 -		if (sr & MCI_SR_RXRDY) {
 +		if (isr & MCI_SR_RXRDY) {
  //			printf("Ready to receive\n");
  		}
 -		if (sr & MCI_SR_CMDRDY) {
 +		if (isr & MCI_SR_CMDRDY) {
  //			printf("Command ready\n");
 -			done = 1;
 -			cmd->error = MMC_ERR_NONE;
 -		}
 -	}
 -	if (done) {
 -		WR4(sc, MCI_IDR, 0xffffffff);
 -		if (cmd != NULL && (cmd->flags & MMC_RSP_PRESENT)) {
 -			for (i = 0; i < ((cmd->flags & MMC_RSP_136) ? 4 : 1);
 -			     i++) {
 -				cmd->resp[i] = RD4(sc, MCI_RSPR + i * 4);
 -//				printf("RSPR[%d] = %x\n", i, cmd->resp[i]);
 -			}
 +			at91_mci_cmdrdy(sc, sr);
  		}
 -		at91_mci_start(sc);
  	}
  	AT91_MCI_UNLOCK(sc);
  }
 @@ -703,7 +1247,7 @@
  		*(int *)result = sc->host.caps;
  		break;
  	case MMCBR_IVAR_MAX_DATA:
 -		*(int *)result = 1;
 +		*(int *)result = MAX_BLOCKS;
  		break;
  	}
  	return (0);
 
 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: arm/155214: commit references a PR
Date: Mon, 27 Aug 2012 04:04:04 +0000 (UTC)

 Author: imp
 Date: Mon Aug 27 04:03:49 2012
 New Revision: 239719
 URL: http://svn.freebsd.org/changeset/base/239719
 
 Log:
   Don't puprosely overclock the SD bus to 30MHz, make the user
   explicltly enable that.  The driver chose to use 60MHz / 2 (30MHz)
   most of the time rather than 60MHz / 4 (15MHz) based on the Linux
   driver of the time.  This pushes the spec a little in order to not
   suffer the penalty of running at 15MHz.  However, when other bus
   masters are active in the system, and the user tries 4-wire mode, the
   internal bus arbitration would fail with data loss as a result.
   
   # Comments from PR were reworked to reflect my historical perspective
   
   PR:		155214 (partial)
   Submitted by:	Ian Lepore
 
 Modified:
   head/sys/arm/at91/at91_mci.c
 
 Modified: head/sys/arm/at91/at91_mci.c
 ==============================================================================
 --- head/sys/arm/at91/at91_mci.c	Mon Aug 27 03:09:39 2012	(r239718)
 +++ head/sys/arm/at91/at91_mci.c	Mon Aug 27 04:03:49 2012	(r239719)
 @@ -67,6 +67,53 @@ __FBSDID("$FreeBSD$");
  
  #include "opt_at91.h"
  
 +/*
 + * About running the MCI bus at 30mhz...
 + *
 + * Historically, the MCI bus has been run at 30mhz on systems with a 60mhz
 + * master clock, due to a bug in the mantissa table in dev/mmc.c making it
 + * appear that the card's max speed was always 30mhz.  Fixing that bug causes
 + * the mmc driver to request a 25mhz clock (as it should) and the logic in
 + * at91_mci_update_ios() picks the highest speed that doesn't exceed that limit.
 + * With a 60mhz MCK that would be 15mhz, and that's a real performance buzzkill
 + * when you've been getting away with 30mhz all along.
 + *
 + * By defining AT91_MCI_USE_30MHZ (or setting the 30mhz=1 device hint or
 + * sysctl) you can enable logic in at91_mci_update_ios() to overlcock the SD
 + * bus a little by running it at MCK / 2 when MCK is between greater than
 + * 50MHz and the requested speed is 25mhz.  This appears to work on virtually
 + * all SD cards, since it is what this driver has been doing prior to the
 + * introduction of this option, where the overclocking vs underclocking
 + * decision was automaticly "overclock".  Modern SD cards can run at
 + * 45mhz/1-bit in standard mode (high speed mode enable commands not sent)
 + * without problems.
 + *
 + * Speaking of high-speed mode, the rm9200 manual says the MCI device supports
 + * the SD v1.0 specification and can run up to 50mhz.  This is interesting in
 + * that the SD v1.0 spec caps the speed at 25mhz; high speed mode was added in
 + * the v1.10 spec.  Furthermore, high speed mode doesn't just crank up the
 + * clock, it alters the signal timing.  The rm9200 MCI device doesn't support
 + * these altered timings.  So while speeds over 25mhz may work, they only work
 + * in what the SD spec calls "default" speed mode, and it amounts to violating
 + * the spec by overclocking the bus.
 + *
 + * If you also enable 4-wire mode it's possible the 30mhz transfers will fail.
 + * On the AT91RM9200, due to bugs in the bus contention logic, if you have the
 + * USB host device and OHCI driver enabled will fail.  Even underclocking to
 + * 15MHz, intermittant overrun and underrun errors occur.  Note that you don't
 + * even need to have usb devices attached to the system, the errors begin to
 + * occur as soon as the OHCI driver sets the register bit to enable periodic
 + * transfers.  It appears (based on brief investigation) that the usb host
 + * controller uses so much ASB bandwidth that sometimes the DMA for MCI
 + * transfers doesn't get a bus grant in time and data gets dropped.  Adding
 + * even a modicum of network activity changes the symptom from intermittant to
 + * very frequent.  Members of the AT91SAM9 family have corrected this problem, or
 + * are at least better about their use of the bus.
 + */
 +#ifndef AT91_MCI_USE_30MHZ
 +#define AT91_MCI_USE_30MHZ 1
 +#endif
 +
  #define BBSZ	512
  
  struct at91_mci_softc {
 @@ -76,9 +123,10 @@ struct at91_mci_softc {
  #define	CAP_HAS_4WIRE		1	/* Has 4 wire bus */
  #define	CAP_NEEDS_BYTESWAP	2	/* broken hardware needing bounce */
  	int flags;
 -	int has_4wire;
  #define CMD_STARTED	1
  #define STOP_STARTED	2
 +	int has_4wire;
 +	int use_30mhz;
  	struct resource *irq_res;	/* IRQ resource */
  	struct resource	*mem_res;	/* Memory resource */
  	struct mtx sc_mtx;
 @@ -236,16 +284,33 @@ at91_mci_attach(device_t dev)
  	if (sc->has_4wire)
  		sc->sc_cap |= CAP_HAS_4WIRE;
  
 -	sc->host.f_min = at91_master_clock / 512;
 +#if defined(AT91_MCI_USE_30MHZ) && AT91_MCI_USE_30MHZ != 0
 +	sc->use_30mhz = 1;
 +#endif
 +	resource_int_value(device_get_name(dev), device_get_unit(dev), 
 +			   "30mhz", &sc->use_30mhz);
 +	SYSCTL_ADD_UINT(sctx, SYSCTL_CHILDREN(soid), OID_AUTO, "30mhz",
 +	    CTLFLAG_RW, &sc->use_30mhz, 0, "use 30mhz clock for 25mhz request");
 +
 +	/* Our real min freq is master_clock/512, but upper driver layers are
 +	 * going to set the min speed during card discovery, and the right speed
 +	 * for that is 400khz, so advertise a safe value just under that.
 +	 *
 +	 * For max speed, while the rm9200 manual says the max is 50mhz, it also
 +	 * says it supports only the SD v1.0 spec, which means the real limit is
 +	 * 25mhz. On the other hand, historical use has been to slightly violate
 +	 * the standard by running the bus at 30mhz.  For more information on
 +	 * that, see the comments at the top of this file.
 +	 */
  	sc->host.f_min = 375000;
  	sc->host.f_max = at91_master_clock / 2;
 -	if (sc->host.f_max > 50000000)	
 -		sc->host.f_max = 50000000;	/* Limit to 50MHz */
 -
 +	if (sc->host.f_max > 25000000)	
 +		sc->host.f_max = 25000000;	/* Limit to 25MHz */
  	sc->host.host_ocr = MMC_OCR_320_330 | MMC_OCR_330_340;
  	sc->host.caps = 0;
  	if (sc->sc_cap & CAP_HAS_4WIRE)
  		sc->host.caps |= MMC_CAP_4_BIT_DATA;
 +
  	child = device_add_child(dev, "mmc", 0);
  	device_set_ivars(dev, &sc->host);
  	err = bus_generic_attach(dev);
 @@ -338,23 +403,38 @@ static int
  at91_mci_update_ios(device_t brdev, device_t reqdev)
  {
  	struct at91_mci_softc *sc;
 -	struct mmc_host *host;
  	struct mmc_ios *ios;
  	uint32_t clkdiv;
  
  	sc = device_get_softc(brdev);
 -	host = &sc->host;
 -	ios = &host->ios;
 -	// bus mode?
 +	ios = &sc->host.ios;
 +
 +	/*
 +	 * Calculate our closest available clock speed that doesn't exceed the
 +	 * requested speed.
 +	 *
 +	 * If the master clock is greater than 50MHz and the requested bus
 +	 * speed is 25mhz and the use_30mhz flag is on, set clkdiv to zero to
 +	 * get a master_clock / 2 (25-30MHz) MMC/SD clock rather than settle for
 +	 * the next lower click (12-15MHz). See comments near the top of the
 +	 * file for more info.
 +	 *
 +	 * Whatever we come up with, store it back into ios->clock so that the
 +	 * upper layer drivers can report the actual speed of the bus.
 +	 */
  	if (ios->clock == 0) {
  		WR4(sc, MCI_CR, MCI_CR_MCIDIS);
  		clkdiv = 0;
  	} else {
 -		WR4(sc, MCI_CR, MCI_CR_MCIEN);
 -		if ((at91_master_clock % (ios->clock * 2)) == 0)
 +		WR4(sc, MCI_CR, MCI_CR_MCIEN|MCI_CR_PWSEN);
 +		if (sc->use_30mhz && ios->clock == 25000000 &&
 +		    at91_master_clock > 50000000)
 +			clkdiv = 0;
 +                else if ((at91_master_clock % (ios->clock * 2)) == 0)
  			clkdiv = ((at91_master_clock / ios->clock) / 2) - 1;
  		else
  			clkdiv = (at91_master_clock / ios->clock) / 2;
 +		ios->clock = at91_master_clock / ((clkdiv+1) * 2);
  	}
  	if (ios->bus_width == bus_width_4)
  		WR4(sc, MCI_SDCR, RD4(sc, MCI_SDCR) | MCI_SDCR_SDCBUS);
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 

From: dfilter@FreeBSD.ORG (dfilter service)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: arm/155214: commit references a PR
Date: Tue, 28 Aug 2012 01:29:10 +0000 (UTC)

 Author: imp
 Date: Tue Aug 28 01:28:52 2012
 New Revision: 239762
 URL: http://svn.freebsd.org/changeset/base/239762
 
 Log:
   Bring in the multi-block patches for mci.  These required extensive
   restructuring of the driver.  I've tried to preserve the other silicon
   workarounds that we've added over the years, but haven't had a chance
   to extensively test on other hardware.  On my AT91RM9200 with 30MHz/1
   wire/64 block transfers, I've been able to go from ~.66MB/s to
   2.25MB/s in the simple tests I performed, almost a 3.5x improvement.
   This cuts the boot time almost in half when everything else goes
   right (timed from rtc message to login: prompt).
   
   PR:		155214
   Submitted by:	Ian Lapore
 
 Modified:
   head/sys/arm/at91/at91_mci.c
 
 Modified: head/sys/arm/at91/at91_mci.c
 ==============================================================================
 --- head/sys/arm/at91/at91_mci.c	Mon Aug 27 23:27:41 2012	(r239761)
 +++ head/sys/arm/at91/at91_mci.c	Tue Aug 28 01:28:52 2012	(r239762)
 @@ -114,7 +114,24 @@ __FBSDID("$FreeBSD$");
  #define AT91_MCI_USE_30MHZ 1
  #endif
  
 -#define BBSZ	512
 +/*
 + * Allocate 2 bounce buffers we'll use to endian-swap the data due to the rm9200
 + * erratum.  We use a pair of buffers because when reading that lets us begin
 + * endian-swapping the data in the first buffer while the DMA is reading into
 + * the second buffer.  (We can't use the same trick for writing because we might
 + * not get all the data in the 2nd buffer swapped before the hardware needs it;
 + * dealing with that would add complexity to the driver.)
 + *
 + * The buffers are sized at 16K each due to the way the busdma cache sync
 + * operations work on arm.  A dcache_inv_range() operation on a range larger
 + * than 16K gets turned into a dcache_wbinv_all().  That needlessly flushes the
 + * entire data cache, impacting overall system performance.
 + */
 +#define BBCOUNT     2
 +#define BBSIZE      (16*1024)
 +#define MAX_BLOCKS  ((BBSIZE*BBCOUNT)/512)
 +
 +static int mci_debug;
  
  struct at91_mci_softc {
  	void *intrhand;			/* Interrupt handle */
 @@ -123,21 +140,25 @@ struct at91_mci_softc {
  #define	CAP_HAS_4WIRE		1	/* Has 4 wire bus */
  #define	CAP_NEEDS_BYTESWAP	2	/* broken hardware needing bounce */
  	int flags;
 -#define CMD_STARTED	1
 -#define STOP_STARTED	2
 +#define PENDING_CMD	0x01
 +#define PENDING_STOP	0x02
 +#define CMD_MULTIREAD	0x10
 +#define CMD_MULTIWRITE	0x20
  	int has_4wire;
  	int use_30mhz;
  	struct resource *irq_res;	/* IRQ resource */
  	struct resource	*mem_res;	/* Memory resource */
  	struct mtx sc_mtx;
  	bus_dma_tag_t dmatag;
 -	bus_dmamap_t map;
 -	int mapped;
  	struct mmc_host host;
  	int bus_busy;
  	struct mmc_request *req;
  	struct mmc_command *curcmd;
 -	char bounce_buffer[BBSZ];
 +	bus_dmamap_t bbuf_map[BBCOUNT];
 +	char      *  bbuf_vaddr[BBCOUNT]; /* bounce bufs in KVA space */
 +	uint32_t     bbuf_len[BBCOUNT];	  /* len currently queued for bounce buf */
 +	uint32_t     bbuf_curidx;	  /* which bbuf is the active DMA buffer */
 +	uint32_t     xfer_offset;	  /* offset so far into caller's buf */
  };
  
  static inline uint32_t
 @@ -172,6 +193,51 @@ static int at91_mci_is_mci1rev2xx(void);
  #define AT91_MCI_ASSERT_LOCKED(_sc)	mtx_assert(&_sc->sc_mtx, MA_OWNED);
  #define AT91_MCI_ASSERT_UNLOCKED(_sc) mtx_assert(&_sc->sc_mtx, MA_NOTOWNED);
  
 +static void 
 +at91_bswap_buf(struct at91_mci_softc *sc, void * dptr, void * sptr, uint32_t memsize)
 +{
 +	uint32_t * dst = (uint32_t *)dptr;
 +	uint32_t * src = (uint32_t *)sptr;
 +	uint32_t   i;
 +
 +	/*
 +	 * If the hardware doesn't need byte-swapping, let bcopy() do the
 +	 * work.  Use bounce buffer even if we don't need byteswap, since
 +	 * buffer may straddle a page boundry, and we don't handle
 +	 * multi-segment transfers in hardware.  Seen from 'bsdlabel -w' which
 +	 * uses raw geom access to the volume.  Greg Ansley (gja (at)
 +	 * ansley.com)
 +	 */
 +	if (!(sc->sc_cap & CAP_NEEDS_BYTESWAP)) {
 +		bcopy(dptr, sptr, memsize);
 +		return;
 +	}
 +
 +	/*
 +	 * Nice performance boost for slightly unrolling this loop.
 +	 * (But very little extra boost for further unrolling it.)
 +	 */
 +	for (i = 0; i < memsize; i += 16) {
 +		*dst++ = bswap32(*src++);
 +		*dst++ = bswap32(*src++);
 +		*dst++ = bswap32(*src++);
 +		*dst++ = bswap32(*src++);
 +	}
 +
 +	/* Mop up the last 1-3 words, if any. */
 +	for (i = 0; i < (memsize & 0x0F); i += 4) {
 +		*dst++ = bswap32(*src++);
 +	}
 +}
 +
 +static void
 +at91_mci_getaddr(void *arg, bus_dma_segment_t *segs, int nsegs, int error)
 +{
 +	if (error != 0)
 +		return;
 +	*(bus_addr_t *)arg = segs[0].ds_addr;
 +}
 +
  static void
  at91_mci_pdc_disable(struct at91_mci_softc *sc)
  {
 @@ -186,13 +252,57 @@ at91_mci_pdc_disable(struct at91_mci_sof
  	WR4(sc, PDC_TNCR, 0);
  }
  
 +/*
 + * Reset the controller, then restore most of the current state.
 + *
 + * This is called after detecting an error.  It's also called after stopping a
 + * multi-block write, to un-wedge the device so that it will handle the NOTBUSY
 + * signal correctly.  See comments in at91_mci_stop_done() for more details.
 + */
 +static void at91_mci_reset(struct at91_mci_softc *sc)
 +{
 +	uint32_t mr;
 +	uint32_t sdcr;
 +	uint32_t dtor;
 +	uint32_t imr;
 +
 +	at91_mci_pdc_disable(sc);
 +
 +	/* save current state */
 +
 +	imr  = RD4(sc, MCI_IMR);
 +	mr   = RD4(sc, MCI_MR) & 0x7fff;
 +	sdcr = RD4(sc, MCI_SDCR);
 +	dtor = RD4(sc, MCI_DTOR);
 +
 +	/* reset the controller */
 +
 +	WR4(sc, MCI_IDR, 0xffffffff);
 +	WR4(sc, MCI_CR, MCI_CR_MCIDIS | MCI_CR_SWRST);
 +
 +	/* restore state */
 +
 +	WR4(sc, MCI_CR, MCI_CR_MCIEN|MCI_CR_PWSEN);
 +	WR4(sc, MCI_MR, mr);
 +	WR4(sc, MCI_SDCR, sdcr);
 +	WR4(sc, MCI_DTOR, dtor);
 +	WR4(sc, MCI_IER, imr);
 +
 +	/*
 +	 * Make sure sdio interrupts will fire.  Not sure why reading
 +	 * SR ensures that, but this is in the linux driver.
 +	 */
 +
 +	RD4(sc, MCI_SR);
 +}
 +
  static void
  at91_mci_init(device_t dev)
  {
  	struct at91_mci_softc *sc = device_get_softc(dev);
  	uint32_t val;
  
 -	WR4(sc, MCI_CR, MCI_CR_MCIEN);		/* Enable controller */
 +	WR4(sc, MCI_CR, MCI_CR_MCIDIS | MCI_CR_SWRST); /* device into reset */
  	WR4(sc, MCI_IDR, 0xffffffff);		/* Turn off interrupts */
  	WR4(sc, MCI_DTOR, MCI_DTOR_DTOMUL_1M | 1);
  	val = MCI_MR_PDCMODE;
 @@ -203,10 +313,19 @@ at91_mci_init(device_t dev)
  #ifndef  AT91_MCI_SLOT_B
  	WR4(sc, MCI_SDCR, 0);			/* SLOT A, 1 bit bus */
  #else
 -	/* XXX Really should add second "unit" but nobody using using
 -	 * a two slot card that we know of. -- except they are... XXX */
 +	/*
 +	 * XXX Really should add second "unit" but nobody using using 
 +	 * a two slot card that we know of. XXX
 +	 */
  	WR4(sc, MCI_SDCR, 1);			/* SLOT B, 1 bit bus */
  #endif
 +	/*
 +	 * Enable controller, including power-save.  The slower clock
 +	 * of the power-save mode is only in effect when there is no
 +	 * transfer in progress, so it can be left in this mode all
 +	 * the time.
 +	 */
 +	WR4(sc, MCI_CR, MCI_CR_MCIEN|MCI_CR_PWSEN);
  }
  
  static void
 @@ -216,7 +335,7 @@ at91_mci_fini(device_t dev)
  
  	WR4(sc, MCI_IDR, 0xffffffff);		/* Turn off interrupts */
  	at91_mci_pdc_disable(sc);
 -	WR4(sc, MCI_CR, MCI_CR_MCIDIS | MCI_CR_SWRST); /* Put the device into reset */
 +	WR4(sc, MCI_CR, MCI_CR_MCIDIS | MCI_CR_SWRST); /* device into reset */
  }
  
  static int
 @@ -234,7 +353,7 @@ at91_mci_attach(device_t dev)
  	struct sysctl_ctx_list *sctx;
  	struct sysctl_oid *soid;
  	device_t child;
 -	int err;
 +	int err, i;
  
  	sctx = device_get_sysctl_ctx(dev);
  	soid = device_get_sysctl_tree(dev);
 @@ -249,21 +368,33 @@ at91_mci_attach(device_t dev)
  
  	AT91_MCI_LOCK_INIT(sc);
  
 +	at91_mci_fini(dev);
 +	at91_mci_init(dev);
 +
  	/*
 -	 * Allocate DMA tags and maps
 +	 * Allocate DMA tags and maps and bounce buffers.
 +	 *
 +	 * The parms in the tag_create call cause the dmamem_alloc call to
 +	 * create each bounce buffer as a single contiguous buffer of BBSIZE
 +	 * bytes aligned to a 4096 byte boundary.
 +	 *
 +	 * Do not use DMA_COHERENT for these buffers because that maps the
 +	 * memory as non-cachable, which prevents cache line burst fills/writes,
 +	 * which is something we need since we're trying to overlap the
 +	 * byte-swapping with the DMA operations.
  	 */
 -	err = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0,
 -	    BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, MAXPHYS, 1,
 -	    MAXPHYS, BUS_DMA_ALLOCNOW, NULL, NULL, &sc->dmatag);
 +	err = bus_dma_tag_create(bus_get_dma_tag(dev), 4096, 0,
 +	    BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, 
 +	    BBSIZE, 1, BBSIZE, 0, NULL, NULL, &sc->dmatag);
  	if (err != 0)
  		goto out;
  
 -	err = bus_dmamap_create(sc->dmatag, 0,  &sc->map);
 -	if (err != 0)
 -		goto out;
 -
 -	at91_mci_fini(dev);
 -	at91_mci_init(dev);
 +	for (i = 0; i < BBCOUNT; ++i) {
 +		err = bus_dmamem_alloc(sc->dmatag, (void **)&sc->bbuf_vaddr[i],
 +		    BUS_DMA_NOWAIT, &sc->bbuf_map[i]);
 +		if (err != 0)
 +			goto out;
 +	}
  
  	/*
  	 * Activate the interrupt
 @@ -330,8 +461,15 @@ out:
  static int
  at91_mci_detach(device_t dev)
  {
 +	struct at91_mci_softc *sc = device_get_softc(dev);
 +
  	at91_mci_fini(dev);
  	at91_mci_deactivate(dev);
 +
 +	bus_dmamem_free(sc->dmatag, sc->bbuf_vaddr[0], sc->bbuf_map[0]);
 +	bus_dmamem_free(sc->dmatag, sc->bbuf_vaddr[1], sc->bbuf_map[1]);
 +	bus_dma_tag_destroy(sc->dmatag);
 +
  	return (EBUSY);	/* XXX */
  }
  
 @@ -398,14 +536,6 @@ at91_mci_is_mci1rev2xx(void)
  	}
  }
  
 -static void
 -at91_mci_getaddr(void *arg, bus_dma_segment_t *segs, int nsegs, int error)
 -{
 -	if (error != 0)
 -		return;
 -	*(bus_addr_t *)arg = segs[0].ds_addr;
 -}
 -
  static int
  at91_mci_update_ios(device_t brdev, device_t reqdev)
  {
 @@ -437,7 +567,7 @@ at91_mci_update_ios(device_t brdev, devi
  		if (sc->use_30mhz && ios->clock == 25000000 &&
  		    at91_master_clock > 50000000)
  			clkdiv = 0;
 -                else if ((at91_master_clock % (ios->clock * 2)) == 0)
 +		else if ((at91_master_clock % (ios->clock * 2)) == 0)
  			clkdiv = ((at91_master_clock / ios->clock) / 2) - 1;
  		else
  			clkdiv = (at91_master_clock / ios->clock) / 2;
 @@ -456,73 +586,182 @@ at91_mci_update_ios(device_t brdev, devi
  static void
  at91_mci_start_cmd(struct at91_mci_softc *sc, struct mmc_command *cmd)
  {
 -	size_t len;
 -	uint32_t cmdr, ier = 0, mr;
 -	uint32_t *src, *dst;
 -	int i;
 +	uint32_t cmdr, mr;
  	struct mmc_data *data;
 -	void *vaddr;
 -	bus_addr_t paddr;
  
  	sc->curcmd = cmd;
  	data = cmd->data;
 -	cmdr = cmd->opcode;
  
  	/* XXX Upper layers don't always set this */
  	cmd->mrq = sc->req;
  
 +	/* Begin setting up command register. */
 +
 +	cmdr = cmd->opcode;
 +
 +	if (sc->host.ios.bus_mode == opendrain)
 +		cmdr |= MCI_CMDR_OPDCMD;
 +
 +	/* Set up response handling.  Allow max timeout for responses. */
 +
  	if (MMC_RSP(cmd->flags) == MMC_RSP_NONE)
  		cmdr |= MCI_CMDR_RSPTYP_NO;
  	else {
 -		/* Allow big timeout for responses */
  		cmdr |= MCI_CMDR_MAXLAT;
  		if (cmd->flags & MMC_RSP_136)
  			cmdr |= MCI_CMDR_RSPTYP_136;
  		else
  			cmdr |= MCI_CMDR_RSPTYP_48;
  	}
 -	if (cmd->opcode == MMC_STOP_TRANSMISSION)
 -		cmdr |= MCI_CMDR_TRCMD_STOP;
 -	if (sc->host.ios.bus_mode == opendrain)
 -		cmdr |= MCI_CMDR_OPDCMD;
 -	if (!data) {
 -		// The no data case is fairly simple
 +
 +	/*
 +	 * If there is no data transfer, just set up the right interrupt mask
 +	 * and start the command.
 +	 *
 +	 * The interrupt mask needs to be CMDRDY plus all non-data-transfer
 +	 * errors. It's important to leave the transfer-related errors out, to
 +	 * avoid spurious timeout or crc errors on a STOP command following a
 +	 * multiblock read.  When a multiblock read is in progress, sending a
 +	 * STOP in the middle of a block occasionally triggers such errors, but
 +	 * we're totally disinterested in them because we've already gotten all
 +	 * the data we wanted without error before sending the STOP command.
 +	 */
 +
 +	if (data == NULL) {
 +		uint32_t ier = MCI_SR_CMDRDY | 
 +		    MCI_SR_RTOE | MCI_SR_RENDE | 
 +		    MCI_SR_RCRCE | MCI_SR_RDIRE | MCI_SR_RINDE;
 +
  		at91_mci_pdc_disable(sc);
 -//		printf("CMDR %x ARGR %x\n", cmdr, cmd->arg);
 +
 +		if (cmd->opcode == MMC_STOP_TRANSMISSION)
 +			cmdr |= MCI_CMDR_TRCMD_STOP;
 +
 +		/* Ignore response CRC on CMD2 and ACMD41, per standard. */
 +
 +		if (cmd->opcode == MMC_SEND_OP_COND ||
 +		    cmd->opcode == ACMD_SD_SEND_OP_COND)
 +			ier &= ~MCI_SR_RCRCE;
 +
 +		if (mci_debug)
 +			printf("CMDR %x (opcode %d) ARGR %x no data\n", 
 +			    cmdr, cmd->opcode, cmd->arg);
 +
  		WR4(sc, MCI_ARGR, cmd->arg);
  		WR4(sc, MCI_CMDR, cmdr);
 -		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_CMDRDY);
 +		WR4(sc, MCI_IDR, 0xffffffff);
 +		WR4(sc, MCI_IER, ier);
  		return;
  	}
 +
 +	/* There is data, set up the transfer-related parts of the command. */
 +
  	if (data->flags & MMC_DATA_READ)
  		cmdr |= MCI_CMDR_TRDIR;
 +
  	if (data->flags & (MMC_DATA_READ | MMC_DATA_WRITE))
  		cmdr |= MCI_CMDR_TRCMD_START;
 +
  	if (data->flags & MMC_DATA_STREAM)
  		cmdr |= MCI_CMDR_TRTYP_STREAM;
 -	if (data->flags & MMC_DATA_MULTI)
 +	else if (data->flags & MMC_DATA_MULTI) {
  		cmdr |= MCI_CMDR_TRTYP_MULTIPLE;
 -	// Set block size and turn on PDC mode for dma xfer and disable
 -	// PDC until we're ready.
 -	mr = RD4(sc, MCI_MR) & ~MCI_MR_BLKLEN;
 -	WR4(sc, MCI_MR, mr | (data->len << 16) | MCI_MR_PDCMODE);
 -	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 -	if (cmdr & MCI_CMDR_TRCMD_START) {
 -		len = data->len;
 -		if (cmdr & MCI_CMDR_TRDIR)
 -			vaddr = cmd->data->data;
 -		else {
 -			/* Use bounce buffer even if we don't need
 -			 * byteswap, since buffer may straddle a page
 -			 * boundry, and we don't handle multi-segment
 -			 * transfers in hardware.
 -			 * (page issues seen from 'bsdlabel -w' which
 -			 * uses raw geom access to the volume).
 -			 * Greg Ansley (gja (at) ansley.com)
 -			 */
 -			vaddr = sc->bounce_buffer;
 -			src = (uint32_t *)cmd->data->data;
 -			dst = (uint32_t *)vaddr;
 +		sc->flags |= (data->flags & MMC_DATA_READ) ? 
 +				CMD_MULTIREAD : CMD_MULTIWRITE;
 +	}
 +
 +	/*
 +	 * Disable PDC until we're ready.
 +	 *
 +	 * Set block size and turn on PDC mode for dma xfer.
 +	 * Note that the block size is the smaller of the amount of data to be
 +	 * transferred, or 512 bytes.  The 512 size is fixed by the standard;
 +	 * smaller blocks are possible, but never larger.
 +	 */
 +
 +	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS); 
 +
 +	mr = RD4(sc,MCI_MR) & ~MCI_MR_BLKLEN; 
 +	mr |=  min(data->len, 512) << 16; 
 +	WR4(sc, MCI_MR, mr | MCI_MR_PDCMODE|MCI_MR_PDCPADV);
 +
 +	/*
 +	 * Set up DMA.
 +	 *
 +	 * Use bounce buffers even if we don't need to byteswap, because doing
 +	 * multi-block IO with large DMA buffers is way fast (compared to
 +	 * single-block IO), even after incurring the overhead of also copying
 +	 * from/to the caller's buffers (which may be in non-contiguous physical
 +	 * pages).
 +	 *
 +	 * In an ideal non-byteswap world we could create a dma tag that allows
 +	 * for discontiguous segments and do the IO directly from/to the
 +	 * caller's buffer(s), using ENDRX/ENDTX interrupts to chain the
 +	 * discontiguous buffers through the PDC. Someday.
 +	 *
 +	 * If a read is bigger than 2k, split it in half so that we can start
 +	 * byte-swapping the first half while the second half is on the wire.
 +	 * It would be best if we could split it into 8k chunks, but we can't
 +	 * always keep up with the byte-swapping due to other system activity,
 +	 * and if an RXBUFF interrupt happens while we're still handling the
 +	 * byte-swap from the prior buffer (IE, we haven't returned from
 +	 * handling the prior interrupt yet), then data will get dropped on the
 +	 * floor and we can't easily recover from that.  The right fix for that
 +	 * would be to have the interrupt handling only keep the DMA flowing and
 +	 * enqueue filled buffers to be byte-swapped in a non-interrupt context.
 +	 * Even that won't work on the write side of things though; in that
 +	 * context we have to have all the data ready to go before starting the
 +	 * dma.
 +	 *
 +	 * XXX what about stream transfers?
 +	 */
 +	sc->xfer_offset = 0;
 +	sc->bbuf_curidx = 0;
 +
 +	if (data->flags & (MMC_DATA_READ | MMC_DATA_WRITE)) {
 +		uint32_t len;
 +		uint32_t remaining = data->len;
 +		bus_addr_t paddr;
 +		int err;
 +
 +		if (remaining > (BBCOUNT*BBSIZE))
 +			panic("IO read size exceeds MAXDATA\n");
 +
 +		if (data->flags & MMC_DATA_READ) {
 +			if (remaining > 2048) // XXX
 +				len = remaining / 2;
 +			else
 +				len = remaining;
 +			err = bus_dmamap_load(sc->dmatag, sc->bbuf_map[0], 
 +			    sc->bbuf_vaddr[0], len, at91_mci_getaddr, 
 +			    &paddr, BUS_DMA_NOWAIT);
 +			if (err != 0)
 +				panic("IO read dmamap_load failed\n");
 +			bus_dmamap_sync(sc->dmatag, sc->bbuf_map[0], 
 +			    BUS_DMASYNC_PREREAD);
 +			WR4(sc, PDC_RPR, paddr);
 +			WR4(sc, PDC_RCR, len / 4);
 +			sc->bbuf_len[0] = len;
 +			remaining -= len;
 +			if (remaining == 0) {
 +				sc->bbuf_len[1] = 0;
 +			} else {
 +				len = remaining;
 +				err = bus_dmamap_load(sc->dmatag, sc->bbuf_map[1], 
 +				    sc->bbuf_vaddr[1], len, at91_mci_getaddr, 
 +				    &paddr, BUS_DMA_NOWAIT);
 +				if (err != 0)
 +					panic("IO read dmamap_load failed\n");
 +				bus_dmamap_sync(sc->dmatag, sc->bbuf_map[1], 
 +				    BUS_DMASYNC_PREREAD);
 +				WR4(sc, PDC_RNPR, paddr);
 +				WR4(sc, PDC_RNCR, len / 4);
 +				sc->bbuf_len[1] = len;
 +				remaining -= len;
 +			}
 +			WR4(sc, PDC_PTCR, PDC_PTCR_RXTEN);
 +		} else {
 +			len = min(BBSIZE, remaining);
  			/*
  			 * If this is MCI1 revision 2xx controller, apply
  			 * a work-around for the "Data Write Operation and
 @@ -530,74 +769,75 @@ at91_mci_start_cmd(struct at91_mci_softc
  			 */
  			if (at91_mci_is_mci1rev2xx() && data->len < 12) {
  				len = 12;
 -				memset(dst, 0, 12);
 +				memset(data->data, 0, 12);
  			}
 -			if (sc->sc_cap & CAP_NEEDS_BYTESWAP) {
 -				for (i = 0; i < data->len / 4; i++)
 -					dst[i] = bswap32(src[i]);
 -			} else
 -				memcpy(dst, src, data->len);
 -		}
 -		data->xfer_len = 0;
 -		if (bus_dmamap_load(sc->dmatag, sc->map, vaddr, len,
 -		    at91_mci_getaddr, &paddr, 0) != 0) {
 -			cmd->error = MMC_ERR_NO_MEMORY;
 -			sc->req = NULL;
 -			sc->curcmd = NULL;
 -			cmd->mrq->done(cmd->mrq);
 -			return;
 -		}
 -		sc->mapped++;
 -		if (cmdr & MCI_CMDR_TRDIR) {
 -			bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_PREREAD);
 -			WR4(sc, PDC_RPR, paddr);
 -			WR4(sc, PDC_RCR, len / 4);
 -			ier = MCI_SR_ENDRX;
 -		} else {
 -			bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_PREWRITE);
 -			WR4(sc, PDC_TPR, paddr);
 +			at91_bswap_buf(sc, sc->bbuf_vaddr[0], data->data, len);
 +			err = bus_dmamap_load(sc->dmatag, sc->bbuf_map[0], 
 +			    sc->bbuf_vaddr[0], len, at91_mci_getaddr, 
 +			    &paddr, BUS_DMA_NOWAIT);
 +			if (err != 0)
 +				panic("IO write dmamap_load failed\n");
 +			bus_dmamap_sync(sc->dmatag, sc->bbuf_map[0], 
 +			    BUS_DMASYNC_PREWRITE);
 +			WR4(sc, PDC_TPR,paddr);
  			WR4(sc, PDC_TCR, len / 4);
 -			ier = MCI_SR_TXBUFE;
 +			sc->bbuf_len[0] = len;
 +			remaining -= len;
 +			if (remaining == 0) {
 +				sc->bbuf_len[1] = 0;
 +			} else {
 +				len = remaining;
 +				at91_bswap_buf(sc, sc->bbuf_vaddr[1],
 +				    ((char *)data->data)+BBSIZE, len);
 +				err = bus_dmamap_load(sc->dmatag, sc->bbuf_map[1], 
 +				    sc->bbuf_vaddr[1], len, at91_mci_getaddr, 
 +				    &paddr, BUS_DMA_NOWAIT);
 +				if (err != 0)
 +					panic("IO write dmamap_load failed\n");
 +				bus_dmamap_sync(sc->dmatag, sc->bbuf_map[1], 
 +				    BUS_DMASYNC_PREWRITE);
 +				WR4(sc, PDC_TNPR, paddr);
 +				WR4(sc, PDC_TNCR, len / 4);
 +				sc->bbuf_len[1] = len;
 +				remaining -= len;
 +			}
 +			/* do not enable PDC xfer until CMDRDY asserted */
  		}
 +		data->xfer_len = 0; /* XXX what's this? appears to be unused. */
  	}
 -//	printf("CMDR %x ARGR %x with data\n", cmdr, cmd->arg);
 +
 +	if (mci_debug)
 +		printf("CMDR %x (opcode %d) ARGR %x with data len %d\n", 
 +		       cmdr, cmd->opcode, cmd->arg, cmd->data->len);
 +
  	WR4(sc, MCI_ARGR, cmd->arg);
 -	if (cmdr & MCI_CMDR_TRCMD_START) {
 -		if (cmdr & MCI_CMDR_TRDIR) {
 -			WR4(sc, PDC_PTCR, PDC_PTCR_RXTEN);
 -			WR4(sc, MCI_CMDR, cmdr);
 -		} else {
 -			WR4(sc, MCI_CMDR, cmdr);
 -			WR4(sc, PDC_PTCR, PDC_PTCR_TXTEN);
 -		}
 -	}
 -	WR4(sc, MCI_IER, MCI_SR_ERROR | ier);
 +	WR4(sc, MCI_CMDR, cmdr);
 +	WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_CMDRDY);
  }
  
  static void
 -at91_mci_start(struct at91_mci_softc *sc)
 +at91_mci_next_operation(struct at91_mci_softc *sc)
  {
  	struct mmc_request *req;
  
  	req = sc->req;
  	if (req == NULL)
  		return;
 -	// assert locked
 -	if (!(sc->flags & CMD_STARTED)) {
 -		sc->flags |= CMD_STARTED;
 -//		printf("Starting CMD\n");
 +
 +	if (sc->flags & PENDING_CMD) {
 +		sc->flags &= ~PENDING_CMD;
  		at91_mci_start_cmd(sc, req->cmd);
  		return;
 -	}
 -	if (!(sc->flags & STOP_STARTED) && req->stop) {
 -//		printf("Starting Stop\n");
 -		sc->flags |= STOP_STARTED;
 +	} else if (sc->flags & PENDING_STOP) {
 +		sc->flags &= ~PENDING_STOP;
  		at91_mci_start_cmd(sc, req->stop);
  		return;
  	}
 -	/* We must be done -- bad idea to do this while locked? */
 +
 +	WR4(sc, MCI_IDR, 0xffffffff);
  	sc->req = NULL;
  	sc->curcmd = NULL;
 +	//printf("req done\n");
  	req->done(req);
  }
  
 @@ -607,16 +847,16 @@ at91_mci_request(device_t brdev, device_
  	struct at91_mci_softc *sc = device_get_softc(brdev);
  
  	AT91_MCI_LOCK(sc);
 -	// XXX do we want to be able to queue up multiple commands?
 -	// XXX sounds like a good idea, but all protocols are sync, so
 -	// XXX maybe the idea is naive...
  	if (sc->req != NULL) {
  		AT91_MCI_UNLOCK(sc);
  		return (EBUSY);
  	}
 +	//printf("new req\n");
  	sc->req = req;
 -	sc->flags = 0;
 -	at91_mci_start(sc);
 +	sc->flags = PENDING_CMD;
 +	if (sc->req->stop)
 +		sc->flags |= PENDING_STOP;
 +	at91_mci_next_operation(sc);
  	AT91_MCI_UNLOCK(sc);
  	return (0);
  }
 @@ -654,120 +894,351 @@ at91_mci_release_host(device_t brdev, de
  }
  
  static void
 -at91_mci_read_done(struct at91_mci_softc *sc)
 +at91_mci_read_done(struct at91_mci_softc *sc, uint32_t sr)
  {
 -	uint32_t *walker;
 -	struct mmc_command *cmd;
 -	int i, len;
 -
 -	cmd = sc->curcmd;
 -	bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_POSTREAD);
 -	bus_dmamap_unload(sc->dmatag, sc->map);
 -	sc->mapped--;
 -	if (sc->sc_cap & CAP_NEEDS_BYTESWAP) {
 -		walker = (uint32_t *)cmd->data->data;
 -		len = cmd->data->len / 4;
 -		for (i = 0; i < len; i++)
 -			walker[i] = bswap32(walker[i]);
 -	}
 -	// Finish up the sequence...
 -	WR4(sc, MCI_IDR, MCI_SR_ENDRX);
 -	WR4(sc, MCI_IER, MCI_SR_RXBUFF);
 -	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 +	struct mmc_command *cmd = sc->curcmd;
 +	char * dataptr = (char *)cmd->data->data;
 +	uint32_t curidx = sc->bbuf_curidx;
 +	uint32_t len = sc->bbuf_len[curidx];
 +
 +	/*
 +	 * We arrive here when a DMA transfer for a read is done, whether it's
 +	 * a single or multi-block read.
 +	 *
 +	 * We byte-swap the buffer that just completed, and if that is the
 +	 * last buffer that's part of this read then we move on to the next
 +	 * operation, otherwise we wait for another ENDRX for the next bufer.
 +	 */
 +
 +	bus_dmamap_sync(sc->dmatag, sc->bbuf_map[curidx], BUS_DMASYNC_POSTREAD);
 +	bus_dmamap_unload(sc->dmatag, sc->bbuf_map[curidx]);
 +
 +	at91_bswap_buf(sc, dataptr + sc->xfer_offset, sc->bbuf_vaddr[curidx], len);
 +
 +	if (mci_debug) {
 +		printf("read done sr %x curidx %d len %d xfer_offset %d\n",
 +		       sr, curidx, len, sc->xfer_offset);
 +	}
 +
 +	sc->xfer_offset += len;
 +	sc->bbuf_curidx = !curidx; /* swap buffers */
 +
 +	/*
 +	 * If we've transferred all the data, move on to the next operation.
 +	 *
 +	 * If we're still transferring the last buffer, RNCR is already zero but
 +	 * we have to write a zero anyway to clear the ENDRX status so we don't
 +	 * re-interrupt until the last buffer is done.
 +	 */
 +	if (sc->xfer_offset == cmd->data->len) {
 +		WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 +		cmd->error = MMC_ERR_NONE;
 +		at91_mci_next_operation(sc);
 +	} else {
 +		WR4(sc, PDC_RNCR, 0);
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_ENDRX);
 +	}
  }
  
  static void
 -at91_mci_xmit_done(struct at91_mci_softc *sc)
 +at91_mci_write_done(struct at91_mci_softc *sc, uint32_t sr)
  {
 -	// Finish up the sequence...
 +	struct mmc_command *cmd = sc->curcmd;
 +
 +	/*
 +	 * We arrive here when the entire DMA transfer for a write is done,
 +	 * whether it's a single or multi-block write.  If it's multi-block we
 +	 * have to immediately move on to the next operation which is to send
 +	 * the stop command.  If it's a single-block transfer we need to wait
 +	 * for NOTBUSY, but if that's already asserted we can avoid another
 +	 * interrupt and just move on to completing the request right away.
 +	 */
 +
  	WR4(sc, PDC_PTCR, PDC_PTCR_RXTDIS | PDC_PTCR_TXTDIS);
 -	WR4(sc, MCI_IDR, MCI_SR_TXBUFE);
 -	WR4(sc, MCI_IER, MCI_SR_NOTBUSY);
 -	bus_dmamap_sync(sc->dmatag, sc->map, BUS_DMASYNC_POSTWRITE);
 -	bus_dmamap_unload(sc->dmatag, sc->map);
 -	sc->mapped--;
 +
 +	bus_dmamap_sync(sc->dmatag, sc->bbuf_map[sc->bbuf_curidx],
 +	    BUS_DMASYNC_POSTWRITE);
 +	bus_dmamap_unload(sc->dmatag, sc->bbuf_map[sc->bbuf_curidx]);
 +
 +	if ((cmd->data->flags & MMC_DATA_MULTI) || (sr & MCI_SR_NOTBUSY)) {
 +		cmd->error = MMC_ERR_NONE;
 +		at91_mci_next_operation(sc);
 +	} else {
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
 +	}
 +}
 +
 +static void
 +at91_mci_notbusy(struct at91_mci_softc *sc)
 +{
 +	struct mmc_command *cmd = sc->curcmd;
 +
 +	/*
 +	 * We arrive here by either completion of a single-block write, or
 +	 * completion of the stop command that ended a multi-block write (and,
 +	 * I suppose, after a card-select or erase, but I haven't tested
 +	 * those).  Anyway, we're done and it's time to move on to the next
 +	 * command.
 +	 */
 +
 +	cmd->error = MMC_ERR_NONE;
 +	at91_mci_next_operation(sc);
 +}
 +
 +static void
 +at91_mci_stop_done(struct at91_mci_softc *sc, uint32_t sr)
 +{
 +	struct mmc_command *cmd = sc->curcmd;
 +
 +	/*
 +	 * We arrive here after receiving CMDRDY for a MMC_STOP_TRANSMISSION
 +	 * command.  Depending on the operation being stopped, we may have to
 +	 * do some unusual things to work around hardware bugs.
 +	 */
 +
 +	/*
 +	 * This is known to be true of at91rm9200 hardware; it may or may not
 +	 * apply to more recent chips: 
 +	 *
 +	 * After stopping a multi-block write, the NOTBUSY bit in MCI_SR does
 +	 * not properly reflect the actual busy state of the card as signaled
 +	 * on the DAT0 line; it always claims the card is not-busy.  If we
 +	 * believe that and let operations continue, following commands will
 +	 * fail with response timeouts (except of course MMC_SEND_STATUS -- it
 +	 * indicates the card is busy in the PRG state, which was the smoking
 +	 * gun that showed MCI_SR NOTBUSY was not tracking DAT0 correctly).
 +	 *
 +	 * The atmel docs are emphatic: "This flag [NOTBUSY] must be used only
 +	 * for Write Operations."  I guess technically since we sent a stop
 +	 * it's not a write operation anymore.  But then just what did they
 +	 * think it meant for the stop command to have "...an optional busy
 +	 * signal transmitted on the data line" according to the SD spec?
 +	 *
 +	 * I tried a variety of things to un-wedge the MCI and get the status
 +	 * register to reflect NOTBUSY correctly again, but the only thing
 +	 * that worked was a full device reset.  It feels like an awfully big
 +	 * hammer, but doing a full reset after every multiblock write is
 +	 * still faster than doing single-block IO (by almost two orders of
 +	 * magnitude: 20KB/sec improves to about 1.8MB/sec best case).
 +	 *
 +	 * After doing the reset, wait for a NOTBUSY interrupt before
 +	 * continuing with the next operation.
 +	 */
 +	if (sc->flags & CMD_MULTIWRITE) {
 +		at91_mci_reset(sc);
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
 +		return;
 +	}
 +
 +	/*
 +	 * This is known to be true of at91rm9200 hardware; it may or may not
 +	 * apply to more recent chips:
 +	 *
 +	 * After stopping a multi-block read, loop to read and discard any
 +	 * data that coasts in after we sent the stop command.  The docs don't
 +	 * say anything about it, but empirical testing shows that 1-3
 +	 * additional words of data get buffered up in some unmentioned
 +	 * internal fifo and if we don't read and discard them here they end
 +	 * up on the front of the next read DMA transfer we do.
 +	 */
 +	if (sc->flags & CMD_MULTIREAD) {
 +		uint32_t sr;
 +		int count = 0;
 +
 +		do {
 +			sr = RD4(sc, MCI_SR);
 +			if (sr & MCI_SR_RXRDY) {
 +				RD4(sc,  MCI_RDR);
 +				++count;
 +			}
 +		} while (sr & MCI_SR_RXRDY);
 +		at91_mci_reset(sc);
 +//              if (count != 0)
 +//                      printf("Had to soak up %d words after read\n", count);
 +	}
 +
 +	cmd->error = MMC_ERR_NONE;
 +	at91_mci_next_operation(sc);
 +
 +}
 +
 +static void
 +at91_mci_cmdrdy(struct at91_mci_softc *sc, uint32_t sr)
 +{
 +	struct mmc_command *cmd = sc->curcmd;
 +	int i;
 +
 +	if (cmd == NULL)
 +		return;
 +
 +	/*
 +	 * We get here at the end of EVERY command.  We retrieve the command
 +	 * response (if any) then decide what to do next based on the command.
 +	 */
 +
 +	if (cmd->flags & MMC_RSP_PRESENT) {
 +		for (i = 0; i < ((cmd->flags & MMC_RSP_136) ? 4 : 1); i++) {
 +			cmd->resp[i] = RD4(sc, MCI_RSPR + i * 4);
 +			if (mci_debug)
 +				printf("RSPR[%d] = %x sr=%x\n", i, cmd->resp[i],  sr);
 +		}
 +	}
 +
 +	/*
 +	 * If this was a stop command, go handle the various special
 +	 * conditions (read: bugs) that have to be dealt with following a stop.
 +	 */
 +	if (cmd->opcode == MMC_STOP_TRANSMISSION) {
 +		at91_mci_stop_done(sc, sr);
 +		return;
 +	}
 +
 +	/*
 +	 * If this command can continue to assert BUSY beyond the response then
 +	 * we need to wait for NOTBUSY before the command is really done.
 +	 *
 +	 * Note that this may not work properly on the at91rm9200.  It certainly
 +	 * doesn't work for the STOP command that follows a multi-block write,
 +	 * so post-stop CMDRDY is handled separately; see the special handling
 +	 * in at91_mci_stop_done().
 +	 *
 +	 * Beside STOP, there are other R1B-type commands that use the busy
 +	 * signal after CMDRDY: CMD7 (card select), CMD28-29 (write protect),
 +	 * CMD38 (erase). I haven't tested any of them, but I rather expect
 +	 * them all to have the same sort of problem with MCI_SR not actually
 +	 * reflecting the state of the DAT0-line busy indicator.  So this code
 +	 * may need to grow some sort of special handling for them too. (This
 +	 * just in: CMD7 isn't a problem right now because dev/mmc.c incorrectly
 +	 * sets the response flags to R1 rather than R1B.) XXX
 +	 */
 +	if ((cmd->flags & MMC_RSP_BUSY)) {
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | MCI_SR_NOTBUSY);
 +		return;
 +	}
 +
 +	/*
 +	 * If there is a data transfer with this command, then...
 +	 * - If it's a read, we need to wait for ENDRX.
 +	 * - If it's a write, now is the time to enable the PDC, and we need
 +	 *   to wait for a BLKE that follows a TXBUFE, because if we're doing
 +	 *   a split transfer we get a BLKE after the first half (when TPR/TCR
 +	 *   get loaded from TNPR/TNCR).  So first we wait for the TXBUFE, and
 +	 *   the handling for that interrupt will then invoke the wait for the
 +	 *   subsequent BLKE which indicates actual completion.
 +	 */
 +	if (cmd->data) {
 +		uint32_t ier;
 +		if (cmd->data->flags & MMC_DATA_READ) {
 +			ier = MCI_SR_ENDRX;
 +		} else {
 +			ier = MCI_SR_TXBUFE;
 +			WR4(sc, PDC_PTCR, PDC_PTCR_TXTEN);
 +		}
 +		WR4(sc, MCI_IER, MCI_SR_ERROR | ier);
 +		return;
 +	}
 +
 +	/*
 +	 * If we made it to here, we don't need to wait for anything more for
 +	 * the current command, move on to the next command (will complete the
 +	 * request if there is no next command).
 +	 */
 +	cmd->error = MMC_ERR_NONE;
 +	at91_mci_next_operation(sc);
  }
  
  static void
  at91_mci_intr(void *arg)
  {
  	struct at91_mci_softc *sc = (struct at91_mci_softc*)arg;
 -	uint32_t sr;
 -	int i, done = 0;
 -	struct mmc_command *cmd;
 +	struct mmc_command *cmd = sc->curcmd;
 +	uint32_t sr, isr;
  
  	AT91_MCI_LOCK(sc);
 -	sr = RD4(sc, MCI_SR) & RD4(sc, MCI_IMR);
 -//	printf("i 0x%x\n", sr);
 -	cmd = sc->curcmd;
 -	if (sr & MCI_SR_ERROR) {
 -		// Ignore CRC errors on CMD2 and ACMD47, per relevant standards
 -		if ((sr & MCI_SR_RCRCE) && (cmd->opcode == MMC_SEND_OP_COND ||
 -		    cmd->opcode == ACMD_SD_SEND_OP_COND))
 -			cmd->error = MMC_ERR_NONE;
 -		else if (sr & (MCI_SR_RTOE | MCI_SR_DTOE))
 +
 +	sr = RD4(sc, MCI_SR);
 +	isr = sr & RD4(sc, MCI_IMR);
 +
 +	if (mci_debug)
 +		printf("i 0x%x sr 0x%x\n", isr, sr);
 +
 +	/*
 +	 * All interrupts are one-shot; disable it now.
 +	 * The next operation will re-enable whatever interrupts it wants.
 +	 */
 +	WR4(sc, MCI_IDR, isr);
 +	if (isr & MCI_SR_ERROR) {
 +		if (isr & (MCI_SR_RTOE | MCI_SR_DTOE))
  			cmd->error = MMC_ERR_TIMEOUT;
 -		else if (sr & (MCI_SR_RCRCE | MCI_SR_DCRCE))
 +		else if (isr & (MCI_SR_RCRCE | MCI_SR_DCRCE))
  			cmd->error = MMC_ERR_BADCRC;
 -		else if (sr & (MCI_SR_OVRE | MCI_SR_UNRE))
 +		else if (isr & (MCI_SR_OVRE | MCI_SR_UNRE))
  			cmd->error = MMC_ERR_FIFO;
  		else
  			cmd->error = MMC_ERR_FAILED;
 -		done = 1;
 -		if (sc->mapped && cmd->error) {
 -			bus_dmamap_unload(sc->dmatag, sc->map);
 -			sc->mapped--;
 +		/*
 +		 * CMD8 is used to probe for SDHC cards, a standard SD card
 +		 * will get a response timeout; don't report it because it's a
 +		 * normal and expected condition.  One might argue that all
 +		 * error reporting should be left to higher levels, but when
 +		 * they report at all it's always EIO, which isn't very
 +		 * helpful. XXX bootverbose?
 +		 */
 +		if (cmd->opcode != 8) {
 +			device_printf(sc->dev, 
 
 *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
 _______________________________________________
 svn-src-all@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/svn-src-all
 To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
 
State-Changed-From-To: open->patched 
State-Changed-By: imp 
State-Changed-When: Mon Aug 27 21:51:28 MDT 2012 
State-Changed-Why:  
Committed. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=155214 
State-Changed-From-To: patched->closed 
State-Changed-By: imp 
State-Changed-When: Mon Feb 24 07:39:42 MST 2014 
State-Changed-Why:  
This bug has been fixed in 10. 


http://www.freebsd.org/cgi/query-pr.cgi?pr=155214 
>Unformatted:
