The correlation coefficient of two random variables $\BX$ and $\BY$ indicates what part of their variance is explained by an affine relationship:
\corr\b{\BX,\BY} \ce \frac{\Cov[\BX,\BY]}{\sqrt{\Var[X]\Var[Y]}} = \frac{\E[(\BX-\E[\BX])(\BY-\E[\BY])]}{\sqrt{\E\b{(\BX-\E[\BX])^2}\E\b{(\BY-\E[\BY])^2}}}.
For example, if $\BY \ce a\BX+b$ for some constants $a>0$ and $b$, then $\corr\b{\BX,\BY} = 1$.
If $\corr\b{\BX,\BY} \ne 0$, then their mutual information $\I[\BX;\BY]$ is positive, but variables can be uncorrelated and yet share information.
Basic properties
The definition doesn’t change if either $\BX$ or $\BY$ is shifted, so let’s assume that their means are both $0$, in which case
\corr\b{\BX,\BY} = \frac{\E[\BX\BY]}{\sqrt{\E[\BX^2]\E[\BY^2]}}.
By Cauchy–Schwarz for expectations, this lies in the interval $[-1,1]$. In fact, if $\BX$, $\BY$ are discrete and we put their outcomes in two vectors $u,v \in \R^n$ such that $\BX,\BY$ is drawn by picking a random index $i \in [n]$ and returning $(x_i,y_i)$, then correlation corresponds to cosine similarity
\corr\b{\BX,\BY} = \frac{u \cdot v}{\norm{u}\norm{v}}.
The correlation doesn’t change if either $\BX$ or $\BY$ is scaled by a positive factor. If we assume that their standard deviations are both $1$, the expression becomes simply
\corr\b{\BX,\BY} = \E[\BX\BY].
Special cases
Normal vectors
A normal random vector $\p{\BX_1, \ldots, \BX_n}$ centered around the origin can be represented as dot products between underlying independent standard normals $\BZ_1, \ldots, \BZ_n$ and some deterministic vectors $v^{(1)}, \ldots, v^{(n)} \in \R^n$. Under this representation, correlation corresponds to cosine similarity:
\corr\b{u \cdot \BZ, v \cdot \BZ}
&= \frac{\E[(u \cdot Z)(v \cdot \BZ)]}{\sqrt{\E\b{(u \cdot \BZ)^2}\E\b{(v \cdot \BZ)^2}}}\\
&= \frac{\sum_{i,j}u_iv_j\E[\BZ_i\BZ_j]}{\sqrt{\p{\sum_{i,j}u_iu_j\E[\BZ_i\BZ_j]}\p{\sum_{i,j}v_iv_j\E[\BZ_i\BZ_j]}}}\\
&= \frac{\sum_i u_iv_i}{\sqrt{\p{\sum_iu_i^2}\p{\sum_iv_i^2}}}\\
&= \frac{u \cdot v}{\norm{u}\norm{v}}.
So one way to define two standard normals $\BX,\BY$ with correlation $\rho$ is to take vectors $u = (1,0)$ and $v=\p{\rho, \sqrt{1-\rho^2}}$, giving
\BX&\ce (1,0) \cdot \BZ = \BZ_1\\
\BY&\ce \p{\rho, \sqrt{1-\rho^2}}\cdot \BZ = \rho \BZ_1 + \sqrt{1-\rho^2}\BZ_2.
We can define uniform random bits (e.g. in $\pmo$) with positive correlation $\rho$ as follows
\BY =
\BX &\text{with probability $\rho$}\\
\pm 1&\text{otherwise (with equal probability)}
(for negative correlation, replace $\BX$ by $-\BX$ and $\rho$ by $-\rho$). In particular,
\E\bco{\BY}{\BX=\pm 1} = \pm \rho.
If $\BX,\BY$ are the indicators of some events $A$ and $B$, then the sign of the correlation indicates whether $A$ makes $B$ more or less likely (and vice versa). By Bayes’ rule, the probability ratios $\frac{\Pr[A \mid B]}{\Pr[A]}$ and $\frac{\Pr[B \mid A]}{\Pr[B]}$ are equal to each other, but they cannot be expressed in terms of $\rho$ only. For example, if $A$ is a $1\%$ event and $B$ happens with #to-write
Fraction of variance explained
#to-write if $\corr\b{\BX,\BY} = \rho$, then $\BX$ explains a $\rho^2$ fraction of $\BY$’s variance, and vice versa
Deviations away from the mean
Under some circumstances, knowing the correlation tells you how deviations from the mean in $\BX$ induce deviations from the mean in $\BY$.
#to-write i.e. $\E\bco{\BY_\perp}{\BX=x} = 0$ in the above
Suppose that $\BX,\BY$ have correlation $\rho$ and the conditional expectation $\E\bco{\BY}{\BX=x}$ is affine in $x$. Then when $\BX$ is $u$ standard deviations away from its mean, $\BY$ will on average be $\rho u$ standard deviations away from its mean.
Since both the lemma and the definition of correlation are invariant to shifts and scalings, let’s assume $\BX$ and $\BY$ have mean $0$ and variance $1$, which means
\rho = \corr\b{\BX,\BY} = \E[\BX\BY].
Suppose that $\E\bco{\BY}{\BX=u} = au+b$ for some constants $a,b$. Then
0 = \E[\BY] = a\E[\BX]+ b = b
\rho = \E[\BX\BY] = \E[\BX(a\BX+b)] = a\E\b{\BX^2}+b\E[\BX] = a,
which means $\E[\BY \mid \BX = u] = au+b = \rho u$, as desired.
Special cases
Random bits
If $\BX$ can only take two values, then $\E\bco{\BY}{\BX=x}$ is always affine in $x$. In particular, when $\BX \in \pmo$, depending on the value of $\BX$, the expectation of $\BY$ moves by $\pm \rho$ standard deviations
Normal vectors
If $\p{\BX,\BY}$ is a normal random vector, then $\E[\BY \mid \BX=x]$ is affine, so this property holds.
To see this, let’s assume $\E[\BX]=\E[\BY]=0$ and again represent them as dot products $\BX \ce u \cdot \BZ$ and $\BY \ce v \cdot \BZ$. Since normals are rotationally symmetric, we can take $u = (a,0)$ and $v=(b,c)$ without loss of generality, which gives
\E[\BY \mid \BX = x] = \E[b\BZ_1 +c\BZ_2 \mid a\BZ_1 = x] = \frac{b}{a}x.
This is false in general
This is not true for all distributions. For example, if $\BX$ is uniform over $[-1,1]$ and $\BY \ce \BX^2$, then $\E[\BX]=0$, $\E[\BY] = 1/3$, and their correlation is
\corr\b{\BX,\BY} = \frac{\E[\BX(\BX^2-1/3)]}{\sqrt{\Var[\BX]\Var[\BX^2]}} = 0
\E\bco{\BY}{\BX=0} = 0 \ne \E[\BY].
That is, even though the correlation is zero (i.e. $\rho = 0$) and we’re conditioning on $\BX$ being equal to its mean (i.e. $u=0$), $\BY$ does deviates from its mean.
Note that here, $\E[\BY \mid \BX=x] = x^2$ which is very much not affine in $x$.