Using R and Python

This libguide covers resources for learning and using R and Python.

Data Services Profile

We are here to help you find, use, manage, visualize and share your data. Contact us to schedule a consultation. View and register for upcoming workshops. Visit our website to learn more about our services.

License

These materials are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, attributable to Data Services, Johns Hopkins University.

The Python programming language is powerful on its own, but can't complete every task a user might want to do. Python users have developed a variety of downloadable software packages, called "libraries", to expand Python's functionality and complete specialized tasks. This page shares some of the common libraries used with Python.

Resources on Python Libraries:

Python Libraries for Data Collection

  • Requests is the go-to library for working with APIs in Python.

  • Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Check out the documentation for Beautiful Soup and download it here

  • Scrapy is a free and open-source web-crawling framework. It was originally designed for web scraping, but it can also be used to extract data using APIs. 

  • Selenium is an open-source web-based automation tool. It is a powerful tool for connecting and sending standard Python commands to different browsers. Follow the installation instructions here.

Python Libraries for Data Cleaning and Manipulation

Statistics and Scientific Computing
  • Pandas is a powerful library for data manipulation and analysis. It uses data structures similar to those found in R. The pandas library is especially useful for exploratory data analysis, statistical analysis, time series data, and machine learning.
  • NumPy has support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. 
  • SciPy is a library that uses NumPy for more mathematical functions. SciPy uses NumPy arrays as the basic data structure and comes with modules for various commonly used tasks in scientific programming, including linear algebra, integration (calculus), ordinary differential equation solving, and signal processing.
Natural Language Processing
  • NLTK, or Natural Language Toolkit, provides functions and data that can be used for natural language processing.
  • SpaCy is an open source library for advanced Natural Language Processing (NLP). It is designed for production use and can be used in applications to that process and “understand” large volumes of text. Installation guide can be found here.
  • FuzzyWuzzy is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern.

Python Libraries for Data Visualization

  • Matplotlib is a flexible plotting library for creating interactive 2D and 3D plots that can also be saved as manuscript-quality figures. The library in many ways reflects that of MATLAB, easing transition of MATLAB users to Python. Many examples, along with the source code to recreate them, are available in the matplotlib gallery.   
  • Seaborn provides a high-level interface for drawing interactive and informative statistical graphics. 
  • Bokeh is used for interactive visualization that targets web browsers for representations. It renders visuals in the browser and works seamlessly with pandas. 
  • Plotly is another Python library for creating interactive graphs and charts.
  • Altair is a library for creating static and interactive statistical graphics using a grammar of graphics.

Python Libraries for Machine Learning/Artificial intelligence

Geospatial Analysis and Mapping with Python

Core Libraries
  • GDAL/OGR are two packages that are commonly downloaded together to use for reading, writing and manipulating geospatial data formats (gdal for raster, ogr for vector).
  • fiona is used to read and write geospatial data files. 
  • pyshp, or the Python shapefile library, is used to read and write shapefiles. This library can be simpler to work with than gdal if you are only using shapefiles. 
  • shapely is used strictly to analyze geometries, and is based on the functions used in PostGIS' GEOS library. As shapely only analyzes geometries, it is used in conjunction with packages like fiona and gdal to read and write geospatial files. 
  • pyproj is used to convert coordinates from one spatial reference system to another. 
Vector Data & Point Clouds
  • geopandas is used for working with vector geospatial data. It extends the datatypes used in pandas and extends the operations in shapely. Geopandas is dependent on fiona for file access and matplotlib for plotting.
  • GeoMesa is used for large-scale, vector data processing on distributed computer systems.
  • PDAL, like GDAL and OGR, is a data abstraction library, but for point cloud processing. 
  • LasPy is simpler than PDAL and is used for reading and writing point cloud data in the standard LAS format. 
  • lidar is used for analyzing high-resolution topographic data and digital elevation models (DEMs). 
Raster Data
  • RSGISLib is used for remote sensing workflows, such as conducting change detection analysis or zonal statistics. 
  • Rasterstats is used to summarize raster data within polygon features. 
  • Rasterframes is used for large-scale, raster data processing on distributed computer systems. 
  • Rasterio is developed by MapBox as an alternative to GDAL, with similar functionality but more 'pythonic' language style. 
Interactive Mapping
Leaflet Python Wrappers

Leaflet is the leading open-source Javascript library for creating interactive maps. The following python libraries serve as bridges to leaflet.js, enabling you to create leaflet maps in python. 

  • ipyleaflet is a library that enables you to create interactive maps within a Jupyter notebook environment. 
  • folium, like ipyleaflet, is a python wrapper for leaflet. Folium provides additional functionality for exporting interactive maps to HTML and other formats.
  • leafmap integrates both ipyleaflet and folium, as well as other geospatial python libraries, to enable users to analyze and visualize geospatial data in a Jupyter notebook environment. 

Google Earth Engine Python Wrapper

Google Earth Engine is a cloud-based platform and geospatial processing service that enables users to visualize and analyze satellite imagery. Google Earth Engine includes a lightweight image data viewer called Earth Engine Explorer (EE Explorer), as well as Javascript and Python APIs to automate your geospatial workflow.

  • geemap is a python library for interactive mapping with Google Earth Engine. 
Esri Python Libraries

These libraries require an active Esri site license/subscription. All JHU affiliates have access to Esri software and platforms through JHU's Esri education site license agreement. For more information on accessing Esri software and platforms, see our Esri Software Access libguide

  • ArcPy is Esri's python geoprocessing library, and can be used with an active Esri site license for managing, processing, analyzing and visualizing geospatial data in a local environment. ArcPy comes installed with ArcGIS Pro in a miniconda environment and is also available in ArcGIS Desktop. 
  • ArcGIS API for Python is Esri's python API wrapper, and can be used to manage and create web maps and applications in ArcGIS Online and/or ArcGIS Enterprise.