In this project, we intend to develop a geospatial large-scale pre-trained model through fine-tuning vision-language models, e.g., CLIP, with geospatial data and theories. Specifically, we intend to use both geospatial visual data (e.g., remote sensing images and street view images) as well as textual data (e.g., points of interest and social media data) to adapt the general-purpose vision-language models to a geospatial and urban context. We expect that the model will then be able to produce highly effective urban representations (embeddings) for various urban analytical tasks, such as urban land use inference and population density estimation.