f(x) = max(0,x)
despite introduced earlier than relu, in DL literature its popularity came after relu due to its characteristics that compensate for the drawbacks of relu.
Like relu, gelu as no upper bound and bounded below. while relu is suddenly zero in negative input ranges, gelu is much smoother in this region. It is differentiable in all ranges, and allows to have gradients(although small) in negative range.
This is advantageous to relu since relu suffers from ‘dying RELU’ problems where significant amount of neuron in the network become zero and don’t practically do anything.
f(x) = x*sigmoid(x)
graph is similar to gelu.
no comparison with gelu in paper.
f(x) = x*tanh(softplus(x))
graph is similar to gelu and swish.
according to the paper mish can handle more deeper layered networks than swish, and in other aspects mish is normally slightly better than swish.
But overall, mish and swish performances are nearly identical.
This work does include gelu in comparison experiments.
interestingly, among various experiments gelu seems to outperform swish in quite a lot of experiements.