In the above image, the division by Z happens implicitly due to homogenous coordinate notation
We will introduce 3 coordinate systems below:
Sometimes the camera coordinate frame and the image coordinate frame is misaligned as shown below:
If we follow the how a 3D point gets left multiplied by extrinsic and then by intrinsic the coordinate frame intuition we derive is:
(3D Point -> Extrinsic -> Intrinsic) = (World Frame -> Camera Frame -> Image Frame)
t = Translation (last column of extrinsic matrix) R = Rotation (first 3x3 part of extrinsic matrix)