This research proposes a distinctive deep learning network architecture for image captioning and description generation. Specifically, we propose a hierarchically trained deep network in order to increase the fluidity and descriptive nature of the generated image captions. The proposed deep network consists of initial regional proposal generation and two key stages for image description generation. The initial regional proposal generation is based upon the Region Proposal Network from the Faster R-CNN. This process generates regions of interest that are then used to annotate and classify human and object attributes. The first key stage of the proposed system conducts detailed label description generation for each region of interest. The second stage uses a Recurrent Neural Network (RNN)-based encoder-decoder structure to translate these regional descriptions into a full image description. Especially, the proposed deep network model can label scenes, objects, human and object attributes, simultaneously, which is achieved through multiple individually trained RNNs The empirical results indicate that our work is comparable to existing research and outperforms state-of-the-art existing methods considerably when evaluated with out-of-domain images from the IAPR TC-12 dataset, especially considering that our system is not trained on images from any of the image captioning datasets. When evaluated with several well-known evaluation metrics, the proposed system achieves an improvement of ∼60% at BLEU-1 over existing methods on the IAPR TC-12 dataset. Moreover, compared with related methods, the proposed deep network requires substantially fewer data samples for training, leading to a much-reduced computational cost.