Serial DMA Routines for the NXP Kinetis MKE16F512
Published by Tim Allen on 17th July 2021


One of the core software modules that we need for any embedded software project is for interrupt-based serial I/O. With that, we can support putchar() and getchar(), which forms the basis of a minimal shell, and allows for debugging and diagnostic information to be output with printf() statements. Transmit and receive ring buffers allow for decoupling of the application from the serial interface.

High Baud Rates and FIFOs

In the old days we ran our debug serial ports at 9600 or 19200 baud, although nowadays we find 115200 baud more convenient as it allows for a reasonable amount of diagnostic output (assuming some gaps in output) while not filling the transmit buffer, at which point the putchar() function ends up waiting for space to appear in the buffer. Each character takes 87µs to transmit or receive (assuming 8 bits, one start and one stop bit) and one or two short interrupt routines every 87µs is not too bad for an embedded processor (we tend to use ARM Cortex M0+ or M4 for many of our projects).

However, we recently had a requirement to run several serial ports at 921600 baud. For a single port, this pushes the potential interrupt rate to 10.9µs between interrupts (one way), and that is unacceptably short and with additional ports, things only get worse. Apart from running the risk of losing incoming characters if a higher-priority interrupt was preventing the receive interrupt from running, a significant amount of processor time would be spent in the transmit and receive interrupt routines. The effects could be mitigated to some extent by assigning a high priority to the receive interrupt and a low priority to the transmit interrupt, but the problem of excessive processor load remains.

Our first thought was to make use of the UART FIFOs which are included on our processor (an MKE16F512 by NXP with an ARM Cortex M4 core). In an ideal scenario, the FIFOs could do the job of the ring buffers (which are after all just FIFOs implemented in software), providing they were large enough. Things looked promising, with the processor reference manual showing the relevant FIFO configuration registers covering FIFO sizes up to 256 bytes, which would just about be acceptable. However, closer scrutiny showed that these are read-only register bits, pre-configured for a FIFO size of 4 bytes! It’s always been a bit of a mystery as to why microcontrollers very often have no UART FIFOs, or if they do, small ones of a few bytes. Even the venerable mid-80’s 16550 UART had 16-byte FIFOs.

4-byte FIFO’s could help a bit in reducing processor load, if used in conjunction with larger ring buffers, but the payback didn’t seem worthwhile. It was time to look into using the DMA controller on the processor.

MKE16F512 eDMA Controller

The MKE16F512’s Enhanced DMA (eDMA) controller supports a modulo function, so it looked entirely reasonable that it should be possible to implement a serial comms module based on DMA rather than interrupts. The DMA controller would co-operate with the UART module to feed bytes from a transmit ring buffer to the UART, and from the UART to a receive ring buffer. NXP provides an SDK, with a DMA module and a UART DMA module. However, the UART DMA module is geared to packet transfers, which most of the time is not what serial comms is used for. They are also covered in quite a layer of hardware abstraction code. Getting on the NXP forums, we weren’t able to find anything which was directly applicable.

The eDMA controller is a bit of a beast (including the chapter on the DMA Multiplexer it runs to just short of 100 pages in the reference manual). However, it has all the functionality to implement serial I/O with ring buffers.

Reception Routines

Referring to sci_edma.c, SciInit() configures the DMA multiplexer, the eDMA and the UART for receiving and transmitting. On the receiving side, once configured, it is simply a matter of calling SciGet() to read a character from the ring buffer - no further involvement is required from the processor code.

The DMAMUX simply maps a peripheral DMA request signal to a DMA channel. There are 16 such channels, and we are using two (one for reception and one for transmission). The UART is set up so that on a character arriving, a DMA request is generated (via the RDMAE bit in the BAUD register). Note that the UART FIFO’s are not used - the DMA is doing the work for us.

In this case our requirement is straightforward - on each DMA request, transfer the contents of the UART DATA register to the array used to implement the ring buffer, then increment the DMA destination register to point to the next ring buffer location:


The data controller supports both minor and major loops. For reception, a 1-byte minor loop is configured with:


We don’t need any major looping, just one pass through the minor loop, so:


One each DMA, we are transferring a byte, so the SSIZE and DSIZE fields of DMA0->TCD[DMA_RX_CHAN].ATTR are set to 0. We also want to implement a ring buffer, so enable the modulo function on the destination and specify the buffer size.

Right at the end of the reference manual chapter on the eDMA is a section “Usage Guide”, which states that the BWC field of DMA0->TCD[DMA_TX_CHAN].CSR should be set to 0b01 when more than one channel is active. This stalls the eDMA engine for 4 cycles after each R/W. The detailed description of these bits imply that in our setup, they have no effect as our minor loops are only one byte, but we do it anyway to err on the side of caution.

That’s pretty much it for reception. The DMA hardware request is enabled:


and the UART DMA signal is also enabled further down:


and from then on, the DMA controller transfers incoming characters to the ring buffer.

It’s worth noting that there is no mechanism to stop the ring buffer overflowing, and if characters are not read from the ring buffer in a timely manner, incoming characters will eventually overwrite those already there. Our applications are designed so that we can always handle the average maximum throughput of incoming characters; the ring buffer’s job is to deal with peak incoming rates.

Transmission Routines

Transmission is a bit more involved. We need to set up an interrupt routine which is called on completion of a transmit DMA operation. If we start by considering that a transmit DMA is already running (i.e. characters are being transferred under DMA from our transmit FIFO to the UART), then on completion of that transfer, the transmit DMA interrupt routine runs. In it, we check to see if the FIFO has any characters in it (i.e. ones which have arrived during the previous DMA operation). If there are, we start a new DMA transfer, for the number of characters in the FIFO. The process repeats until the transmit DMA interrupt routine finds there are no characters in the FIFO, at which point it exits without starting a new DMA.

The initialisation code should be self-explanatory, as it is similar to the reception code, except that the source is now the ring buffer and the destination the UART. We disable DMA and generate an interrupt on completing the major loop with the following line:


This whole process has to be kicked off, and that is done in SciPut(), the function which is called to send a character out by loading it into the FIFO. Having done this, it checks to see if a transmit DMA transfer is currently running. If so there is nothing to do. However, if it finds that a DMA is not already running, it starts a DMA transfer. In this scenario, there can only be a single character in the FIFO (the one just placed there) and so this first DMA transfer can always be set for a count of one.


In the DMA completion interrupt routine, if there are still characters in the ring buffer, we start a DMA transfer for those characters:

n = ((uint32_t)StSciTxInPtr >= DMA0->TCD[DMA_TX_CHAN].SADDR) ?
  (uint32_t)StSciTxInPtr - DMA0->TCD[DMA_TX_CHAN].SADDR :
  SCI_TX_BFR_SZ - (DMA0->TCD[DMA_TX_CHAN].SADDR - (uint32_t)StSciTxInPtr);
if (n)
  StTxDmaRunning = false;

Note that we use a flag variable (StTxDmaRunning) to allow SciPut() to determine whether a DMA transfer is still running. It is not appropriate to use the DMA TCDn_CSR:DONE bit, as that could change midway through SciPut().

With this strategy, and assuming characters are being loaded into the FIFO at speed (say from a call to printf()), what typically happens is that a series of DMA transfers occurs, the first for a single character and successive ones for increasing numbers of characters.


sci_edma.c and sci_edma.h contain the DMA-based drivers based on the techniques described above. They are written for the NXP MKE16F512, but should be able to be readily ported to other NXP devices with the eDMA module.

Categories: Development