There are a couple of ways to do that but in general, the network has 2 separate branches for each input, and in someplace they are going to be merged.
You need to have the same size for both inputs in the middle of your networks like you put some convolutional blocks on both inputs, but somewhere in the middle, you will have to resize one of the branch's feature maps and have the same resolution for both of them. After that, you just concatenate them by their channels. Here is how it will look like.

If you already have 2 different feature maps block with the same resolutions, you just put them together for the next layers. You can also have 2 different outputs like it's drawn in the image for corresponding inputs.