Personal summary of Named Tensor Notation.

Named tensor notation gives names to the axes in vectors, matrices and tensors. These names are analogous to units in physics: if you keep track of the units while manipulating physical quantities, you’ll catch basic mistakes, and the computation will make more sense overall.

For example, instead of defining the parity check matrix of a code as HF(nk)×n, you might define it as HFfgchecks×fgcoords. Then, you can

  • access an individual parity check as Hfgchecks(j)Ffgcoords,
  • access the coefficients for a particular coordinate as Hfgcoords(i)Ffgchecks,
  • access one particular coefficient as Hfgchecks(j),fgcoords(i)F.

Since the axes have names, their order doesn’t matter:

  • Ffgchecks×fgcoords=Ffgcoords×fgchecks,
  • Hfgchecks(j),fgcoords(i) and Hfgcoords(i),fgchecks(j) have the same effect.

Element-wise operations

You can perform any operation element-wise. For example, given two matrices A,BRfgheight×fgwidth, their sum A+B is defined as

(A+B)fgheight(i),fgwidth(j):=Afgheight(i),fgwidth(j)+Bfgheight(i),fgwidth(j).

In general, the result will inherit all the axes present in either operand. For example, given a “column vector” uRfgheight and a “row vector” vRfgwidth, their sum u+v will be the matrix whose elements are

(u+v)fgheight(i),fgwidth(j):=ufgheight(i)+vfgwidth(j).

To avoid confusion, we can denote element-wise multiplication explicitly by . This generalizes the outer product:

(uv)fgheight(i),fgwidth(j):=(uv)fgheight(i),fgwidth(j)=ufgheight(i)vfgwidth(j).

Element-wise operations are commutative/associative whenever the underlying operations are.

Diagonal matrices

This sometimes allows you to use vectors instead of diagonal matrices. For example, if you want to scale the rows of a matrix ARfgheight×fgwidth, instead of multiplying on the left by a diagonal matrix D, you can just take the element-wise product dA with a vector dRfgheight. Indeed, we have

(dA)fgheight(i)=dfgheight(i)scaling factor (R)Afgheight(i)row (Rfgwidth).

Reductions

Reductions “summarize” the information across one of the axes. For example, for a matrix ARfgheight×fgwidth,

  • fgheightARfgwidth is a “row vector” containing the sum of each column,
  • maxfgwidthARfgheight is a “column vector” containing the maxima of each row,
  • AfgheightRfgwidth is a “row vector” containing the (2-)norm of each column.

Contractions

Many linear algebra operations are contractions: an element-wise multiplication followed by a sum. We denote these using the shorthand

AfgfgaxisB:=fgaxisAB.

Contractions generalize:

  • dot product: for x,yRfgcoords, we have xfgfgcoordsyR;
  • matrix-vector product: for ARfgoutput×fginput and xRfginput, we have AfgfginputxRfgoutput;
  • matrix-matrix product: for ARfgoutput×fghidden and BRfghidden×fginput, we have AfgfghiddenBRfgoutput×fginput.

Contractions are great because:

  • they make it clear what you’re summing over,
  • they are commutative,
  • they are (mostly) associative: you can turn (Afgfgaxis1B)fgfgaxis2C into Afgfgaxis1(Bfgfgaxis2C) as long as A doesn’t have fgaxis2 and C doesn’t have fgaxis1;1
  • you don’t need to waste time thinking about whether you should take the transpose.

Limitations

Since contractions “type-check” that the axes being summed over are identical, they cannot represent operations like the trace of a matrix (which depends on its two axes having the same length). Because of this, they are a bit weaker than Einstein summation (e.g. np.einsum), which sums over arbitrary pairs of axes.

Differentiation

Named tensor notation makes it easy to compute some derivatives which if done in matrix form would have required writing out the matrix products explicitly, using the method of differentials.

Let f:RSRT be a function (where S,T are disjoint sets of axes), and let Y=f(X). The main rule is that if you can express the differential Y as

Y=AfgSX+constant

where constant doesn’t depend on X, then

YX=A.

Matrix products

For example, consider a depth-two linear network with a single output

y=xfgfginputWinfgfghiddenwout,

where xRfginput, WinRfginput×fghidden and woutRfghidden, and you want to compute yWin. Then you can compute

y=(xfgfginputWinfgfghiddenwout)=xfgfginputWinfgfghiddenwout+xfgfginputWinfgfghiddenwout+xfgfginputWinfgfghiddenwout=(xwout)AfgfginputfghiddenWin+xfgfginputWinfgfghiddenwout+xfgfginputWinfgfghiddenwoutconstant,

which means yWin=xwout.

More generally, if we have matrices W(i)Rfgaxisi1×fgaxisi for i[d], then the partial derivatives of their product are

(W(1)fgfgaxis1fgfgaxisd1W(d))W(1)=(W(2)fgfgaxis2fgfgaxisd1W(d))Rfgaxis1×fgaxisd(W(1)fgfgaxis1fgfgaxisd1W(d))W(d)=(W(1)fgfgaxis1fgfgaxisd2W(d1))Rfgaxis0×fgaxisd1(W(1)fgfgaxis1fgfgaxisd1W(d))W(i)=(W(1)fgfgaxis1fgfgaxisi2W(i1))(W(i+1)fgfgaxisi+1fgfgaxisd1W(d))Rfgaxis0×fgaxisi1×fgaxisi+1×fgaxisd.

Note that due to the existence of fgaxis0 and fgaxisd, these have more axes than the matrices we’re deriving over. If W(1)Rfgaxis1 instead of Rfgaxis0×fgaxis1, or if W(d)Rfgaxisd1 instead of Rfgaxisd1×fgaxisd, then the derivatives stay the same but the type changes.

Renaming

Given a tensor ARfgaxis1××fgaxisn we can rename one if its axes using the notation Afgaxisifgnewaxis, which now in Rfgaxis1××fgaxisi1×fgnewaxis×fgaxisi+1××fgaxis2 and is defined by

(Afgaxisifgnewaxis)fgnewaxis(k)=Afgaxisi(k).

This is equivalent to multiplying by the corresponding identity matrix:

Afgaxisifgnewaxis=AfgfgaxisiIfgaxisi,fgnewaxis

Transpose

Renaming is mostly used to create a temporary independent copy of an axis that is present twice in a computation. In that use, renaming is analogous to taking the transpose, and we’ll2 use the shorthand AT(fgaxis) to denote either Afgaxisfgaxis or Afgaxisfgaxis depending on which axis A currently has.

For example, say you have a matrix XRfgbatch×fgcoords containing b vectors x(1),,x(b)Rfgcoords, and you want to express its Gram matrix: the matrix of pairwise inner products x(i)fgfgcoordsx(j). Then you might try

XfgfgcoordsX.

However, the fgbatch axis gets merged between the two copies of X, and all this gets you is a vector in Rfgbatch, containing the squared norms

x(i)fgfgcoordsx(i)=x(i)fgcoords2

of each of the original vectors (i.e. we only got the diagonal of the Gram matrix). If we want the result to keep the fgbatch axes separate, we need to rename one of them:

XfgfgcoordsXT(fgbatch)Rfgbatch×fgbatch.

Similarly, we can express the second-moment matrix as

XfgfgbatchXT(fgcoords)Rfgcoords×fgcoords.

In the usual matrix notation, where the Gram matrix would be XXT and the second-moment matrix would be XTX, which is more compact but less explicit about which axes remain and which axis is summed over.

#to-write more general remarks about “square matrices”

  • and how they’re inconvenient to work with in this notation
  • but on the other hand maybe rightfully so?
  • would be nice if you could allow duplicated axes for symmetric matrices? but you’d need notation for that anyway, to stop the values from just getting multiplied coordinatewise / the axes from getting merged

Powers

#to-write

  • for a matrix ARfgaxis×fgaxis can define its power Ad in an unambiguous way
  • but eigenvalues are still annoying to deal with? writing Afgfgaxisv=λvT(fgaxis) is still very distasteful…
    • but maybe that goes to show how unnatural the whole concept of eigenvalues is? :P
    • and it serves as a good reminder that you can’t just write (Aλ)v=0 (since that would mean removing λ from every entry, not just the diagonal)

Fixing associativity

Suppose you have a product of the form

(Afgfgaxis1B)fgfgaxis2C

where C has fgaxis1, which means that fgaxis1 first gets summed over when contracting A with B, then gets added back to the result when contracting with C. We can’t directly apply associativity because the reduction Bfgfgaxis2C would incorrectly merge the two occurrences of fgaxis1. But we can temporarily rename fgaxis1 in C using the transpose, then transpose it back at the end to get the same result:

(Afgfgaxis1B)fgfgaxis2C=((Afgfgaxis1B)fgfgaxis2CT(fgaxis1))T(fgaxis1)=(Afgfgaxis1(Bfgfgaxis2CT(fgaxis1)))T(fgaxis1)

(where the second equality holds as long as A doesn’t have fgaxis2).

  1. The former problem seems unsolvable, since the result should be summing over A’s fgaxis2, but after associating A is now sitting on the outside with fgaxis1 as the only reduction (and also it might be that B doesn’t have fgaxis2). The latter problem can be solved by renaming fgaxis1 in C before associating. 

  2. This shorthand is not part of the original specification