https://www.moria.us/articles/demystifying-the-lfsr/

  * Moria logo, stylized M moria.us
  * Articles
  * Demystifying the LFSR

  * Twitter logo

Demystifying the LFSR

Feb 13, 2022, Programming

Contents

  * What is an LFSR?
  * Questions
  * Brute Force
  * Addition in Binary
  * Subtraction in Binary
  * Multiplication in Binary
  * Division in Binary
  * LFSRs, Now with More Mathematics!
  * Irreducible Polynomials
  * Primitive Polynomials
  * Results
  * More Information

Note: This uses JavaScript to render math formulas and provide
animated circuit diagrams.

Are linear feedback shift registers really that hard to understand? I
think not! All you need to know to understand LFSRs is high-school
level mathematics and an understanding of binary:

  * You know basic algebra and you can work with polynomials like x^2
    + x + 1.
  * You know how to do long division--or at least, you vaguely
    remember what long division is, and you can muddle through it.
  * You know what prime numbers are, and how any number can be
    factored into prime factors, like 20 = 2^2 \cdot 5.
  * You know some digital logic, like what "XOR" means.
  * If you know some group theory, great! Some of this will be
    familiar.

Deeper Mathematics

I'm not going to leave out the deeper mathematics entirely, but put
discussions about more advanced topics in sections marked like this.
You do not need to understand the "deeper mathematics" sections in
order to understand LFSRs.

What is an LFSR?

A linear feedback shift register is a simple way to generate a
sequence of pseudorandom bits by taking a register, shifting its
contents by one bit, and mixing in some new data using an XOR
operation. This is not a rigorous definition, I just want to get you
thinking about how an LFSR actually works--either as a digital circuit
or as software.

LFSRs are often used as random number generators, especially in
constrained software and hardware environments. If you travel back in
time to the 1980s or 1970s, the LFSR was an easy and reliable way to
generate random numbers on 8-bit CPUs like the 8080 or 6502. If
you're working in hardware, you can implement an LFSR with only a
handful of gates and get a fast, reliable random number generator.

There are two different ways to set up an LFSR: the "Fibonacci LFSR"
and the "Galois LFSR". Both versions work by taking certain bits from
the shift register, XORing them together, and putting them back into
the shift register. The two versions just work in opposite
directions.

How to Pronounce "Fibonacci" and "Galois"

"Fibonacci" is pronounced:

  * /,fIb@'na:tSi/ (or /,fi:b-/) in English, "fi-buh-na-chee", and
  * /fi.bo'nat.tSi/ in Italian.

See YouTube: How to Pronounce Fibonacci

"Galois" is pronounced:

  * /gael'wa:/ in English, "gal-waa", and
  * /ga'lwa/ in French.

See YouTube: How to Pronounce Evariste Galois

Fibonacci Configuration LFSRGalois Configuration LFSR

These are both LFSRs. It turns out that the Galois configuration is
easier to implement in software, and it has higher performance when
implemented in hardware, so I'm just going to focus on the Galois
configuration. The Galois configuration and the Fibonacci
configuration are equivalent, but for this article, I'm just going to
ignore the Fibonacci configuration.

Code

It is simple to implement an LFSR in code. I'm choosing to use the
Galois configuration, and to shift the register left each step.

#include <stdint.h>

// Advance the LFSR by one step.
uint32_t lfsr_advance(uint32_t value, int width, uint32_t taps) {
  const uint32_t high_bit = (uint32_t)1 << width;
  value <<= 1;
  // If a 1 bit gets shifted out, then...
  if ((value & high_bit) != 0) {
    // Clear that bit, and flip the taps.
    value ^= high_bit | taps;
  }
  return value;
}

You'll often hard-code the width of the LFSR and the taps, because
it's not something most people care to configure.

// Calculate the next step of a 16-bit LFSR with the taps at
// position 0, 2, 3, and 5. This LFSR has maximum period.
//
// These taps are 101101 in binary, or 0x2d in hexadecimal.
uint32_t my_lfsr_advance(uint32_t value) {
  return lfsr_advance(value, 16, 0x2d);
}

This works as a reasonable random number generator, although it only
generates one bit at a time.

I chose to shift the LFSR to the left to make the explanation a
little easier. You could just as easily shift to the right instead.
It will produce the same stream of bits, in the same order, but the
bits in the register will be numbered in reverse.

// Advance a 16-bit LFSR by one step, shifting to the right. This is
// the same 16-bit LFSR as above, with taps at position 0, 2, 3, and
// 5, but the taps are numbered in the opposite direction.
uint32_t right_lfsr_advance(uint32_t value) {
  uint32_t new_value = value >> 1;
  if ((value & 1) != 0) {
    new_value ^= 0xb400;
  }
  return new_value;
}

Questions

The questions we have are:

  * Where do we put the taps?
  * How many times can you shift before it starts repeating? (What is
    the period of the shift register?)

Just based on our knowledge of digital logic, we can figure out that
if there are k bits in the shift register, there are 2^k possible
states, and the maximum possible period cannot 2^k.

We also realize that if the register contains all zeroes, then the
next state will also be all zeroes. If our shift register gets into
the zero state, it gets stuck. That means that, at best, our shift
register could only loop through the other 2^k-1 possible states. Is
it possible to make a shift register that has a period of 2^k-1?

The answer is "yes".

First of all, we can crib that answer from Wikipedia. Wikipedia says
that we can construct an LFSR with maximum period if the "reciprocal
characteristic polynomial" is a "primitive polynomial". If you just
need to implement an LFSR, you can get a list of "primitive
polynomials" from Xilinx (Application Note XAPP-052) for any length
up to 168 bits, and you're done!

Read on if you want to understand what that means and learn how to
construct your own primitive polynomials.

Deeper Mathematics

You may ask, "Why can't we redesign the LFSR so that it gets out of
the zero state?"

You might consider adding NOT gates to your LFSR, or XORing the state
with a constant, or some other trick. However, these simple redesigns
will only change which state the LFSR gets stuck in, they won't
prevent the LFSR from getting stuck.

Brute Force

Of course, you don't actually need to know about polynomials, for
smaller LFSRs. You can find out the period of an LFSR easily enough
by running it until it repeats.

#include <stdlib.h>

// Calculate the length of an LFSR's period.
int lfsr_period(uint32_t initial_state, int width, uint32_t taps) {
  // Record when each state is first visited.
  int *states = calloc(1 << width, sizeof(*states));
  if (states == NULL) {
    abort();
  }
  uint32_t state = initial_state;
  for (int t = 1;; t++) {
    if (states[state] != 0) {
      return t - states[state];
    }
    states[state] = t;
    state = lfsr_advance(state, width, taps);
  }
}

Addition in Binary

Let's take a moment to step back and talk about addition in binary,
because we're going to be doing addition a little differently.

You may have seen an explanation of how to add ordinary numbers in
binary. For example, in binary, we can add 1011 + 110 = 10001:

111011+11010001

This is "normal" addition, with carrying. What happens when we don't
carry? We get 1011 + 110 = 1121... which is not binary, but that's okay
for now.

1011+1101121

This is just like adding two polynomials together. In fact,
mathematicians prefer to call this polynomial addition.

x3+x+1+x2+xx3+x2+2x+1

What happens when we do addition modulo 2? In modulo 2 arithmetic, 1
+ 1 = 0. Addition modulo 2 is the same as the XOR operation in
digital logic. Modulo 2, we get 1011 + 110 = 1101.

1011+01101101

Or if we think about it as adding polynomials,

x3+x+1+x2+xx3+x2+1

Mathematicians usually call this "polynomial addition" too, but you
can see that it doesn't matter if you write out the polynomial as x^
2+x or if you just write the coefficients as 110.

We're going to be using modulo 2 arithmetic to design LFSRs.

Code

It is easy enough to write a function for polynomial addition modulo
2. Here is the C code:

#include <stdint.h>

// Add two polynomials, modulo 2.
uint32_t mod2_add(uint32_t p, uint32_t q) {
  return p ^ q;
}

This only gives us 32 bits for polynomial coefficients. You may need
to use larger types if you want to work with larger polynomials and
larger LFSRs.

Deeper Mathematics

Mathematicians have different names for modulo 2 arithmetic depending
on what kind of mathematics they are studying. People studying groups
call it a "cyclic group," and write it as \mathbb{Z}/2\mathbb{Z}.
People studying rings might call it a "quotient ring". People
studying finite fields call it a "finite field", and write it as \
mathbb{F}_2 or \mathrm{GF}(2)--the "G" stands for "Galois", who was
mentioned earlier. I will call modulo 2 arithmetic \mathbb{F}_2, just
because it makes explaining LFSRs a little easier.

Mathematicians aren't trying to be difficult by coming up with
different names for the same thing--it turns out that finite fields
and cyclic groups are different things. For example, arithmetic
modulo 6 is It just It turns out that finite fields and cyclic groups
are For one thing, modulo 6 arithmetic is an ordinary cyclic group \
mathbb{Z}/6\mathbb{Z}, but it is not a finite field! The cyclic group
and the finite field are only the same as each other when the size is
a prime number, like 2.

Note that the only numbers that exist in \mathbb{F}_2 are 0 and 1.
We've been using larger numbers like 1011... what's up with that? Well,
when mathematicians do this kind of arithmetic, they say that numbers
(like 1011) are actually polynomials (like x^3+x+1) where all the
coefficients are in \mathbb{F}_2. The polynomials with coefficients
in \mathbb{F}_2 are \mathbb{F}_2[x], which is called a "polynomial
ring".

So, in \mathbb{F}_2 we have the numbers 0 and 1.

In \mathbb{F}_2[x], we have polynomials like x^2+1, x^3+x+1, and 0,
but we can choose to just write the coefficients, like 101, 1011, and
0.

More Information

  * Cyclic Group on Wikipedia, MathWorld
  * Finite Field on Wikipedia, MathWorld
  * Polynomial Ring on Wikipedia

Subtraction in Binary

What happens when you subtract modulo 2? Do you get negative numbers?
Think about what happens when you subtract 1000 - 1011. Is 1011
bigger than 1000?

It turns out that "bigger" or "smaller" isn't really something we
think about when we are working modulo 2. Subtracting 1000 - 1011 we
get 11, and we know that this is right because 1000 + 11 = 1011. In
modulo 2 arithmetic, x - y = x + y. Addition and subtraction are the
same operation, in modulo 2 arithmetic.

Or we can say x = -x, which means the same thing..

Deeper Mathematics

There are a ton of theorems in algebra that can be proven to work in
any algebraic system except modulo 2 arithmetic. The fact that x = -x
, throws a wrench in many of these classic theorems and proofs.

Mathematicians say that \mathbb{F}_2 has "characteristic 2", which
means that if you add 2 copies of 1, you get 1+1=0.

These theorems will start with a phrase like, "Let \mathbb{F} be a
field with characteristic other than 2..."

More Information

  * Characteristic on Wikipedia, MathWorld
  * Mathematics Stack Exchange: What's so special about
    characteristic 2?

Multiplication in Binary

Let's continue exploring modulo 2 arithmetic by talking about
multiplication. Addition is just repeated multiplication, so with our
understanding of modulo 2 arithmetic, we can calculate 1011 \cdot 110
= 111010.

1011x11000001011+1011111010

Keep in mind that this is special, modulo-2 multiplication, without
carry. It's not the same kind of multiplication you get when you
write 0b1011 * 0b110 in a program.

If you're a programmer, you'll recognize that multiplying by 10 is
the same as shifting to the left by one bit. For example, we can
shift 1001101 to the left by one bit:

1001101x100000000+100110110011010

If we think of it as a polynomial, shifting to the left by one bit is
the same as multiplying the polynomial by x (and the polynomial x has
coefficients 10).

Code

The straightforward way to multiply two polynomials, mod 2, is with
long multiplication.

#include <stdint.h>

// Multiply two polynomials, mod 2.
uint32_t mod2_multiply(uint32_t p, uint32_t q) {
  uint32_t result = 0;
  for (int i = 0; i < 32; i++) {
    if (((q >> i) & 1) != 0) {
      result ^= p << i;
    }
  }
  return result;
}

It turns out that this kind of multiplication is important enough
that various CPUs have instructions for it, to make it faster to
multiply polynomials modulo 2. The x86 has pclmulqdq and ARM has
vmull.p8.

Deeper Mathematics

In \mathbb{F}_2[x], what is (x+y)^2?

Answer

\begin{split} (x + y)^2 &= x^2 + 2xy + y^2 \\ &= x^2 + y^2 \end
{split}

This is another reason why various theorems are not true for fields
with characteristic 2, even if they are true for all other fields.

Division in Binary

The final piece of the arithmetic puzzle is division. You can use
long division to divide numbers modulo 2. Division gives us the
quotient and remainder. We're not interested in fractions, so the
quotient will always be an integer.

For example, we can divide 10001101 / 1011 and get 10110, with a
remainder of 111.

10110101110001101-10111111-10111000-1011111

Note that unlike normal division, we just need to look at the highest
bit in each division step.

See also Wikipedia: Polynomial long division for a more general
discussion of how to divide polynomials, although the article doesn't
cover modulo 2 arithmetic.

Code

To divide modulo 2, we just have to translate the old, familiar long
division algorithm to code. This algorithm divides p by q. In each
step, we shift q to the left so the highest 1 bit lines up with the
remainder, and then subtract (which is XOR).

#include <assert.h>
#include <stdint.h>

// Return the index of the highest bit set, 0-31. Returns -1 if no
// bits are set.
int highest_bit(uint32_t p) {
  if (p == 0) {
    return -1;
  }
  int bit = 0;
  while (p > 1) {
    p >>= 1;
    bit++;
  }
  return bit;
}

// The result of dividing two polynomials, modulo 2.
struct division_result {
  uint32_t quotient;  // p / q
  uint32_t remainder; // p % q
};

// Divide the polynomial p by q, modulo 2.
struct division_result mod2_divide(uint32_t p, uint32_t q) {
  assert(q != 0);
  uint32_t quotient = 0;
  uint32_t remainder = p;
  int remainder_width = highest_bit(p);
  int q_width = highest_bit(q);
  while (remainder_width >= q_width) {
    int shift = remainder_width - q_width;
    remainder ^= q << shift;
    quotient |= 1 << shift;
    remainder_width = highest_bit(remainder);
  }
  return (struct division_result){
      .quotient = quotient,
      .remainder = remainder,
  };
}

If you want to check that the division algorithm worked correctly,
there are two things that we want to check. First, we check that the
remainder is smaller than q. If it's not, then we haven't finished
dividing. Second, we multiply the quotient by q, add the remainder,
and check that you get the original value of p.

// Divide the polynomial p by q, modulo 2, and check that the result
// is correct.
void mod2_divide_check(uint32_t p, uint32_t q) {
  struct division_result r = mod2_divide(p, q);

  // Check that the remainder is strictly lower order than q.
  assert(highest_bit(r.remainder) < highest_bit(q));

  // Check that we get the original value of p back.
  assert(mod2_add(mod2_multiply(r.quotient, q), r.remainder) == p);
}

LFSRs, Now with More Mathematics!

If we take a look at LFSRs again, we can explain what LFSRs are doing
in terms of polynomials modulo 2.

We can think of the state of an LFSR with k bits as a polynomial with
degree smaller than k.

For example, a 16-bit LFSR will have a state that contains
coefficients for x^0 up to x^{15}. If the LFSR shifts left, then the
state 0x870C is the polynomial P(x):

P(x) = x^{15} + x^{10} + x^9 + x^8 + x^3 + x^2

When we shift the polynomial left one bit, it's the same as
multiplying the polynomial by x:

x P(x) = x^{16} + x^{11} + x^{10} + x^9 + x^4 + x^3

We can think of the taps for the LFSR as a 16-th degree polynomial,
equal to x^{16}, plus x^n for each tap at position n. With our taps
at 0, 2, 3, and 5, the taps represent the polynomial Q(x):

Q(x) = x^{16} + x^5 + x^3 + x^2 + 1

This is called the characteristic polynomial of the LFSR. Note that
the number of bits in the LFSR is equal to the degree of the
characteristic polynomial.

Propagating the high bit to the taps is the same as dividing by Q(x)
and taking the remainder. This means we have a simple formula for the
next state, P^\prime(x):

\begin{split} P^\prime(x) &= x P(x) \bmod Q(x) \\ &= x^{11} + x^{10}
+ x^9 + x^5 + x^4 + x^2 + 1 \end{split}

Now, how do we pick the taps? How can we calculate the period of an
LFSR?

Deeper Mathematics

Just like we took the integers \mathbb{Z} and constructed the
integers mod 2 as the quotient ring \mathbb{Z}/2\mathbb{Z}, we've
taken the polynomials \mathbb{F}_2[x] and constructed the quotient
ring \mathbb{F}_2[x]/\lang Q(x) \rang, which contains the polynomials
in \mathbb{F}_2[x] modulo Q(x).

(At this point, a mathematician might decide to get technical and say
that a quotient ring contains equivalence classes, and explain that \
lang Q(x) \rang is called a "principal ideal", and but that's not a
distinction we need to make here.)

More Information

  * Quotient ring on Wikipedia, MathWorld
  * Principal Ideal on Wikipedia, MathWorld
  * Ideal on Wikipedia, MathWorld

Irreducible Polynomials

An irreducible polynomial is a polynomial that can't be factored into
the product of two non-constant polynomials. For example, the
polynomial P below is irreducible, but Q is reducible because it can
be factored.

P(x) = x^4 + x + 1

\begin{split} Q(x) &= x^4 + x^3 + x^2 + 1 \\ &= (x^3 + x + 1) (x + 1)
\end{split}

Any LFSR with maximum period must use an irreducible polynomial for
the taps. Why?

Think about what happens if you have a polynomial which can be
factored,

Q(x) = Q_1(x)Q_2(x)

Suppose that the LFSR starts in state P(x)=Q_1(x) Each time this LFSR
advances, if the original state is a multiple of Q_1(x), then the
next state will also be a multiple of Q_1(x).

Proof

The next state is:

P^\prime(x) = x P(x) \bmod Q(x)

This means that there is a quotient, S(x),

P^\prime(x) + S(x)Q(x) = x P(x)

By rearranging, we get,

P^\prime(x) = x P(x) - S(x) Q(x)

If P(x) is divisible by Q_1(x), then xP(x) and S(x)Q(x) are both
divisible by Q_1(x), and therefore P^\prime(x) must also be divisible
by Q_1(x).

If Q_1(x) and Q_2(x) have no common factors, then you are getting the
same result as if you were running two LFSRs in parallel--one for Q_1,
and one for Q_2. Running two shorter LFSRs will give a shorter period
than you would get for a single, maximum-period LFSR with the same
number of bits.

Deeper Mathematics

If Q(x) is an irreducible polynomial in \mathbb{F}_2[x], then the
quotient ring of polynomials modulo Q(x), called \mathbb{F}_2[x]/\
lang Q(X) \rang, is a field.

It turns out that all of these fields are isomorphic to each other,
and if you are just trying to construct a field, it doesn't matter
what you pick for Q(x) as long as it is irreducible. If Q(x) has
order k, then the field is often called \mathbb{F}_{2^k} or \mathrm
{GF}(2^k). Note that 2^k is sometimes just written out as a number
and it means the same thing, so \mathbb{F}_{2^8}, \mathbb{F}_{256}, \
mathrm{GF}(2^8), and \mathrm{GF}(256) are all the same thing.

It does matter for us what we choose for Q(x) because we care about
how the elements of \mathrm{GF}(2^k) are represented.

More Information

  * Irreducible Polynomial on Wikipedia, MathWorld

Finding Irreducible Polynomials

Like many problems, the way that you find irreducible polynomials
depends on how big the polynomials should be.

For smaller polynomials, it's easy enough to try and factor them. If
a polynomial with degree n has factors, then at least one of the
factors must have degree n/2 or smaller. We can list all of the
irreducible polynomials up to degree n/2 and use this list to test
whether polynomials of degree n are irreducible.

This requires an exponential increasing amount of CPU power and
memory as the polynomials grow larger, but in practice, it works fine
if you just need to find a polynomial of degree no more than, say,
40. There are faster algorithms which will let you find much larger
irreducible polynomials.

Code

Here is some code for finding an irreducible polynomial of degree 31
or lower. For higher degree polynomials, the code will require some
modification... but this approach is too slow for polynomials of high
degree anyway.

#include <assert.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>

// A dynamic list of polynomials.
struct poly_list {
  uint32_t *polys;
  int count;
  int alloc_size;
};

// Append a single polynomial to the end of the list, growing the
// array if necessary.
void poly_append(struct poly_list *list, uint32_t poly) {
  if (list->count >= list->alloc_size) {
    int new_size = list->alloc_size == 0 ? 8 : list->alloc_size * 2;
    uint32_t *new_array =
        realloc(list->polys, new_size * sizeof(*new_array));
    if (new_array == NULL) {
      abort();
    }
    list->polys = new_array;
    list->alloc_size = new_size;
  }
  list->polys[list->count] = poly;
  list->count++;
}

// Number of lists of irreducible polynomials.
#define LIST_COUNT 16

// List of irreducible polynomials of each degree, up to degree 15.
struct poly_list irreducible[LIST_COUNT];

// Test if a polynomial is irreducible. Requires that the lists of
// irreducible polynomials be initialized up to and including
// degree/2.
bool is_irreducible(uint32_t poly, int degree) {
  // Test polynomials with degree no larger than degree/2. If this
  // polynomial is divisible by none of those polynomials, it is
  // irreducible.
  for (int n = 1; n <= degree / 2; n++) {
    uint32_t *list = irreducible[n].polys;
    int count = irreducible[n].count;
    for (int i = 0; i < count; i++) {
      struct division_result r = mod2_divide(poly, list[i]);
      if (r.remainder == 0) {
        return false;
      }
    }
  }
  return true;
}

// Initialize the list of irreducible polynomials in order to find
// irreducible polynomials of the given degree.
void init_irreducible(int degree) {
  assert(degree / 2 < LIST_COUNT);
  for (int n = 1; n <= degree / 2; n++) {
    uint32_t start = (uint32_t)1 << n;
    uint32_t end = start << 1;
    for (uint32_t poly = start; poly < end; poly++) {
      if (is_irreducible(poly, degree)) {
        poly_append(&irreducible[n], poly);
      }
    }
  }
}

// Find an irreducible polynomial with the given degree.
uint32_t find_irreducible(int degree) {
  init_irreducible(degree);
  uint32_t start = (uint32_t)1 << degree;
  uint32_t end = start << 1;
  for (uint32_t poly = start; poly < end; poly++) {
    if (is_irreducible(poly, degree)) {
      return poly;
    }
  }
  // Mathematically impossible! For any degree, it is always
  // possible to find an irreducible polynomial.
  assert(false);
}

Primitive Polynomials

Here's where we get into the real math. Not all irreducible
polynomials give us a maximum period LFSRs, but we know that the
characteristic polynomial must be irreducible. It turns out that the
polynomial must not only be irreducible, but also a primitive
polynomial. There are different but equivalent ways of defining
"primitive polynomial", but for our purposes, we can just say that a
primitive polynomial is a polynomial that makes an LFSR with maximum
possible period.

To simplify the notation, I'll write polynomials as p or q instead of
P(x) or Q(x).

Let q be the LFSR's characteristic polynomial, with degree k. The
possible states for the LFSR are, \mathcal{P}_k(\mathbb{F}_2), the
set of polynomials with degree less than k. Let S(p) be the set of
all states reachable from initial state p.

S(p) = \{ px^n \bmod q : n \in \mathbb{N} \}

If q is a primitive polynomial, then S(p) will contain 2^k-1 states,
and our LFSR has maximum period.

Deeper Mathematics

There are a couple interesting bits of group and ring theory here.

Primitive polynomials are also defined as polynomials, Q(x), where Q
(x) is a factor of x^n+1 for n=2^k-1, but not for any 0 < n < 2^k-1.
If you think about the way LFSRs work, this is just equivalent to
saying that the LFSR has period 2^k-1!

(There is another common way of defining "primitive polynomial" which
defines it in terms of something called field extensions.)

The nonzero elements \mathrm{GF}(2^k)\setminus\{0\} form a group,
using multiplication as a group operation. S(1) is just a subgroup of
this group, also written as \lang x \rang, the subgroup generated by 
x. S(p) is a coset of this group, written as p\lang x \rang.

More Information

  * Primitive Polynomial on Wikipedia, Primitive Polynomial.
    MathWorld notes that, "Amazingly, primitive polynomials over GF
    (2) define a recurrence relation which can be used to obtain a
    new pseudorandom bit from the n preceding ones." Doesn't that
    sound familiar? It's talking about LFSRs!
  * Subgroup on Wikipedia, MathWorld
  * Coset on Wikipedia, MathWorld

Period and Initial State

We can prove that all sets S(p) for p \ne 0 have the same size. In
other words, the period of the LFSR does not depend on which state
you start in, as long as you don't start in the zero state.

First, we can redefine S(p) in terms of S(1):

\begin{split} S(p) &= \{p x^n \bmod q : n \in \mathbb{N} \} \\ &= \{p
(x^n \bmod q) \bmod q : n \in \mathbb{N} \} \\ &= \{p s \bmod q : s \
in S(1) \} \end{split}

Written this way, it's obvious that S(p) cannot be larger than S(1).
If S(p) is smaller, then it must mean that two different elements, 
s_1, s_2 \in S(1) become the same element in S(p), which means that

ps_1 = ps_2 \mod q.

By rearranging,

p(s_1 - s_2) = 0 \mod q.

Since q is irreducible, it must either be a factor of p or a factor
of s_1 - s_2. If it's a factor of p, then p=0. If it's a factor of 
s_1 - s_2, then s_1 = s_2. Both possibilities contradict the premise,
so the size of S(1) and S(p) is the same, if p \ne 0.

Partitioning All States

We can prove that any set of states, S(p_1) and S(p_2), are either
equal to each other or disjoint.

We can exclude S(0) because it is the only set which contains 0.

Suppose that some element is in two sets, p \in S(p_1) \cap S(p_2),
where p \ne 0.

From the way these sets are constructed, both S(p_1) and S(p_2) must
contain all states reachable from p, S(p). Since all these sets are
the same size, they must be the same set.

LFSR Period

This brings us to the final piece of the puzzle, where we can prove
what the period of the LFSR is.

Consider the set of all sets of reachable states, excluding the zero
state.

T = \{ S(p) : p \in \mathcal{P}(\mathbb{F}_2), p \ne 0 \}

Each element in T is a set of states that the LFSR can be stuck in.
Every nonzero state must appear in at least one of these sets,
because p \in \mathcal{P}(\mathbb{F}_2). However, no state appears in
more than one of these sets, because we proved that these sets are
either disjoint or equal.

Since all of the sets in T are the same size, then the total number
of nonzero states must be equal to the number of sets in T,
multiplied by the number of states reachable from a nonzero state.

\lvert T \rvert \cdot \lvert S(1) \rvert = 2^k-1

In other words, the period of an LFSR which uses an irreducible
polynomial must be a divisor of 2^k-1.

How does this help us?

Deeper Mathematics

This is Lagrange's Group Theorem.

More Information

  * Lagrange's Group Theorem on Wikipedia, MathWorld

Finding Primitive Polynomials

We already have a system for finding irreducible polynomials. Once we
find an irreducible polynomial, we can test to see if it is primitive
by checking that the period is 2^k-1.

Instead of iterating through 2^k-1 possible states, there's a
shortcut. Since the period of an LFSR with an irreducible
characteristic polynomial must be a divisor of 2^k-1, we only have to
check the divisors of 2^k-1. In fact, we don't have to check all the
divisors, we just find every prime factor a, and check that the LFSR
period is not a divisor of (2^k-1)/a.

For some number n, we can check whether the LFSR's period is a factor
of n by calculating x^n \bmod q. If the period is a factor of n, then
x^n \bmod q = 1.

Example

For an 8-bit LFSR, we will end up with a maximum period of 255. The
prime factors of 255 are 3, 5, and 17. That means that the only
periods we need to check for an 8-bit LFSR are 85 (255/3), 51 (255/
5), and 15 (255/17).

In math terms, this means taking an irreducible polynomial Q(x) and
testing that the following equations are true. If an irreducible
polynomial with degree 8 satisfies these equations, it is a primitive
polynomial.

\begin{split} x^{85} & \ne x \mod Q(x) \\ x^{51} & \ne x \mod Q(x) \\
x^{15} & \ne x \mod Q(x) \end{split}

Since the LFSR must repeat when advanced 255 times, however, the
following formula will hold:

x^{255} = x \mod Q(x)

Code

First, we need to find the factors of 2^k-1 which we want to test.

// Maximum number of factors to test. Note that 32-bit number can
// only have 9 prime factors, at most.
#define MAX_FACTORS 16

// Given an integer n, find all unique factors of the form n/a,
// where a is a prime number and a != n. Returns the number of
// factors.
int factorize(int *factors, int n) {
  int count = 0;
  int rem = n;
  int factor = 2;
  for (; rem >= factor * factor; factor++) {
    if ((rem % factor) == 0) {
      assert(count < MAX_FACTORS);
      factors[count] = n / factor;
      count++;
      while ((rem % factor) == 0) {
        rem /= factor;
      }
    }
  }
  // If rem == n, then n is prime, and we will return 0 factors.
  if (rem > 1 && rem != n) {
    assert(count < MAX_FACTORS);
    factors[count] = n / rem;
    count++;
  }
  return count;
}

Next, we write a function to calculate x^n \bmod q. Ther's an
algorithm for this that runs in \mathrm{O}(\log n) time. It works by
repeatedly squaring polynomials. First, we can write a function for
multiplying two polynomials, modulo q.

// Multiply two polynomials, mod 2, and return the result, mod q.
uint32_t multiply_modq(uint32_t a, uint32_t b, uint32_t q,
                       int width) {
  uint32_t result = 0;
  // p = b * x^i mod q.
  uint32_t p = b;
  for (int i = 0; i < width; i++) {
    // Calculate p = p * x mod q.
    p <<= 1;
    if (((p >> width) & 1) != 0) {
      p ^= q;
    }
    // Accumulate the result.
    if (((a >> i) & 1) != 0) {
      result ^= p;
    }
  }
  return result;
}

The exponentiation algorithm works by observing that:

a^b = \begin{cases} 1 &\text{if } b = 0 \\ (a^2)^{b/2} &\text{if $b$
even, $b \ne 0$} \\ aa^{b-1} &\text{if $b$ odd} \end{cases}

This formula still holds with modular arithmetic, polynomials, and
LFSRs.

// Compute pow(a, b), mod q.
uint32_t pow_modq(uint32_t a, int b, uint32_t q, int width) {
  if (b == 0) {
    return 1;
  }
  // Loop invariant: The final result is always equal to:
  // accum * power ^ exponent.
  uint32_t accum = 1;
  uint32_t power = a;
  for (int exponent = b; exponent > 1; exponent >>= 1) {
    if ((exponent & 1) != 0) {
      accum = multiply_modq(accum, power, q, width);
    }
    power = multiply_modq(power, power, q, width);
  }
  return multiply_modq(accum, power, q, width);
}

Using these functions, we can find a primitive polynomial with any
order we want, and therefore a maximum period LFSR of any size.

// Return true if a polynomial is a primitive polynomial. The
// factors must be the result of factorize.
bool is_primitive(uint32_t poly, int degree, int *factors,
                  int factor_count) {
  if (!is_irreducible(poly, degree)) {
    return false;
  }
  for (int i = 0; i < factor_count; i++) {
    // This calculates x^n mod poly, where n is the factor we are
    // testing. "2" is the representation of the polynomial x.
    if (pow_modq(2, factors[i], poly, degree) == 1) {
      return false;
    }
  }
  return true;
}

// Find a primitive polynomial with the given degree.
uint32_t find_primitive(int degree) {
  // List all factors we want to test.
  int factors[MAX_FACTORS];
  int factor_count = factorize(factors, (1 << degree) - 1);

  // Loop over irreducible polynomials.
  init_irreducible(degree);
  uint32_t start = (uint32_t)1 << degree;
  uint32_t end = start << 1;
  for (uint32_t poly = start; poly < end; poly++) {
    if (is_primitive(poly, degree, factors, factor_count)) {
      return poly;
    }
  }
  // Mathematically impossible! For any degree, it is always
  // possible to find a primitive polynomial.
  assert(false);
}

Results

With the code above, we can search for a primitive polynomial, then
check that the period of an LFSR with that polynomial has the maximum
possible length. Here are the smallest primitive polynomials of
various orders up to 24:

Size Polynomial (hex)
8    0x11d
9    0x211
10   0x409
11   0x805
12   0x1053
14   0x402b
16   0x1002d
20   0x100009
24   0x100001b

More Information

  * Wikipedia: Linear feedback shift register
  * David A. Singer, Secrets of Linear Feedback Shift Registers. This
    presents a more rigorous explanation of primitive polynomials,
    which uses somewhat more advanced foundations. It's short and
    sweet, only ten pages long, but it does assume that you are
    comfortable with linear algebra, groups, rings, and fields.
  * Jason Sachs, Linear Feedback Shift Registers for the Uninitiated,
    Part XVIII: Primitive Polynomial Generation. This article goes
    into more depth. Sachs discusses runtime performance and presents
    different algorithms for constructing primitive polynomials. The
    article covers eleven different algorithms, explains the math,
    and provides source code--it's quite thorough!