1, After you get essential matrix , you can use SVD to recover rotation and translation matrix. But the translation vector is normalized vector or say unit vector or say up to a scale. The rotation matrix is correct.
2, You can never get this scale factor from images without additional information. If you are using stereo vision, you know the baseline, so you can recover the scale factor by this baseline, then you can get Euclidean coordinate. OTHERWISE you can use some known landmark whose world coordinate is already known.
3, So, what's the structure or motion? Now you should get Euclidean transformation is not only transformation. It's a children of metric transformation which is children of affine transformation. All above transformation are children of projective transformation. From multi-view image, you just can recover to metric transformation with some constraint about the camera model.
4, The different between metric and euclidean is a scale factor. That means metric transformation can change object's size or line's length and rotation , translation. But euclidean is just contain rotation and translation.
It's really difficulty to understand these geometry stuff. But if I feel difficult, so do other guys. Who win depend on who can continue and insist on.