Data Science in Python for Climate Data Analysis

gdfa.ugr.es/python

Oceanhackweek, University of Washington (Aug 26-30, 2019)
  • Download-analyze model: data analysts first download the desired dataset from a remote server to local workstations or cluster infrastructure, in order to perform the desired analysis

Problem: the size and variety of datasets have increased exponentially and new data science methodologies have appeared

CF Conventions

  • The Climate and Forecast (CF) metadata conventions are conventions for the description of Earth sciences data, intended to promote the processing and sharing of data files. The CF conventions were introduced in 2003, after several years of development by a collaboration that included staff from U.S. and European climate and weather laboratories
  • In December 2008 the trio of standards, netCDF+CF+OPeNDAP, was adopted by IOOS as a recommended standard for the representation and transport of gridded data. The CF conventions are being considered by the NASA Standards Process Group (SPG) and others as more broadly applicable standards
  • CF originated as a standard for data written in netCDF, but its structure is general and it has been adapted for use with other data formats

OPeNDAP

  • An attempt focused on enhancing the retrieval of remote, structured data through a Web-based architecture
  • Client may construct DAP constraint expressions to retrieve specific content (i.e., subsets)
  • OPeNDAP servers offer various types of responses, including XML, JSON, HTML, ASCII or NetCDF
  • ECMWF: Copernicus (ERA5), ESGF: NASA, NOAA... (CMIP5, CORDEX)
In [1]:
from IPython.display import IFrame

IFrame("http://opendap.puertos.es/thredds/catalog.html", width=1200, height=600)
Out[1]:

xarray

  • xarray is inspired by and borrows heavily from pandas
  • It is particularly tailored to working with netCDF files
  • It integrates tightly with dask for parallel computin
In [2]:
%matplotlib inline
In [3]:
import warnings

warnings.simplefilter("ignore")
In [4]:
%ls data/*.nc
data/significant_height_of_combined_wind_waves_and_swell.nc
data/total_precipitation.nc
In [5]:
import xarray as xr

ds = xr.open_dataset("data/total_precipitation.nc")
In [6]:
ds
Out[6]:
<xarray.Dataset>
Dimensions:    (latitude: 17, longitude: 17, time: 99352)
Coordinates:
  * longitude  (longitude) float32 -24.0 -23.875 -23.75 ... -22.25 -22.125 -22.0
  * latitude   (latitude) float32 17.0 16.875 16.75 16.625 ... 15.25 15.125 15.0
  * time       (time) datetime64[ns] 1980-01-01 ... 2013-12-31T21:00:00
Data variables:
    tp         (time, latitude, longitude) float32 ...
Attributes:
    Conventions:  CF-1.6
    history:      2019-03-09 15:31:29 GMT by grib_to_netcdf-2.10.0: /opt/ecmw...
In [7]:
ds = xr.open_mfdataset("data/*.nc", combine="by_coords")  # , parallel=True
In [8]:
ds
Out[8]:
<xarray.Dataset>
Dimensions:    (latitude: 17, longitude: 17, time: 99352)
Coordinates:
  * longitude  (longitude) float32 -24.0 -23.875 -23.75 ... -22.25 -22.125 -22.0
  * latitude   (latitude) float32 17.0 16.875 16.75 16.625 ... 15.25 15.125 15.0
  * time       (time) datetime64[ns] 1980-01-01 ... 2013-12-31T21:00:00
Data variables:
    swh        (time, latitude, longitude) float32 dask.array<chunksize=(99352, 17, 17), meta=np.ndarray>
    tp         (time, latitude, longitude) float32 dask.array<chunksize=(99352, 17, 17), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    history:      2019-03-09 05:31:25 GMT by grib_to_netcdf-2.10.0: /opt/ecmw...
In [9]:
ds_noaa = xr.open_dataset(
    "http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/noaa.ersst.v5/sst.mnmean.nc"
)
In [10]:
ds_noaa
Out[10]:
<xarray.Dataset>
Dimensions:    (lat: 89, lon: 180, nbnds: 2, time: 1989)
Coordinates:
  * lat        (lat) float32 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
  * lon        (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
  * time       (time) datetime64[ns] 1854-01-01 1854-02-01 ... 2019-09-01
Dimensions without coordinates: nbnds
Data variables:
    time_bnds  (time, nbnds) float64 ...
    sst        (time, lat, lon) float32 ...
Attributes:
    climatology:                     Climatology is based on 1971-2000 SST, X...
    description:                     In situ data: ICOADS2.5 before 2007 and ...
    keywords_vocabulary:             NASA Global Change Master Directory (GCM...
    keywords:                        Earth Science > Oceans > Ocean Temperatu...
    instrument:                      Conventional thermometers
    source_comment:                  SSTs were observed by conventional therm...
    geospatial_lon_min:              -1.0
    geospatial_lon_max:              359.0
    geospatial_laty_max:             89.0
    geospatial_laty_min:             -89.0
    geospatial_lat_max:              89.0
    geospatial_lat_min:              -89.0
    geospatial_lat_units:            degrees_north
    geospatial_lon_units:            degrees_east
    cdm_data_type:                   Grid
    project:                         NOAA Extended Reconstructed Sea Surface ...
    original_publisher_url:          http://www.ncdc.noaa.gov
    References:                      https://www.ncdc.noaa.gov/data-access/ma...
    source:                          In situ data: ICOADS R3.0 before 2015, N...
    title:                           NOAA ERSSTv5 (in situ only)
    history:                         created 07/2017 by PSD data using NCEI's...
    institution:                     This version written at NOAA/ESRL PSD: o...
    citation:                        Huang et al, 2017: Extended Reconstructe...
    platform:                        Ship and Buoy SSTs from ICOADS R3.0 and ...
    standard_name_vocabulary:        CF Standard Name Table (v40, 25 January ...
    processing_level:                NOAA Level 4
    Conventions:                     CF-1.6, ACDD-1.3
    metadata_link:                   :metadata_link = https://doi.org/10.7289...
    creator_name:                    Boyin Huang (original)
    date_created:                    2017-06-30T12:18:00Z (original)
    product_version:                 Version 5
    creator_url_original:            https://www.ncei.noaa.gov
    license:                         No constraints on data access or use
    comment:                         SSTs were observed by conventional therm...
    summary:                         ERSST.v5 is developed based on v4 after ...
    dataset_title:                   NOAA Extended Reconstructed SST V5
    data_modified:                   2019-10-03
    DODS_EXTRA.Unlimited_Dimension:  time

lazy loading of:

  • on-disk datasets or
  • remote
In [11]:
main_var = "swh"
aux_var = "tp"
In [12]:
ds[main_var]
Out[12]:
<xarray.DataArray 'swh' (time: 99352, latitude: 17, longitude: 17)>
dask.array<open_dataset-0b05cd88f78082bb538e7cda359a3130swh, shape=(99352, 17, 17), dtype=float32, chunksize=(99352, 17, 17), chunktype=numpy.ndarray>
Coordinates:
  * longitude  (longitude) float32 -24.0 -23.875 -23.75 ... -22.25 -22.125 -22.0
  * latitude   (latitude) float32 17.0 16.875 16.75 16.625 ... 15.25 15.125 15.0
  * time       (time) datetime64[ns] 1980-01-01 ... 2013-12-31T21:00:00
Attributes:
    units:      m
    long_name:  Significant height of combined wind waves and swell
In [13]:
time_ini = "2000-01-01"
time_end = "2002-12-31"

longitude = -22
latitude = 16
In [14]:
ds[main_var].sel(time=slice(time_ini, time_end)).plot()
Out[14]:
(array([6.18100e+04, 7.18842e+05, 9.41067e+05, 5.19233e+05, 2.11784e+05,
        6.11640e+04, 1.61720e+04, 3.19200e+03, 4.64000e+02, 2.24000e+02]),
 array([0.76566815, 1.1487132 , 1.5317584 , 1.9148035 , 2.2978487 ,
        2.680894  , 3.0639389 , 3.446984  , 3.8300292 , 4.213074  ,
        4.5961194 ], dtype=float32),
 <a list of 10 Patch objects>)

The purpose of visualization is insight, not pictures

― Ben Shneiderman

In [15]:
import hvplot.xarray
In [16]:
ds[main_var].sel(time=slice(time_ini, time_end)).hvplot()
Out[16]:
In [17]:
ds[main_var].sel(longitude=longitude, latitude=latitude, method="nearest").hvplot()
Out[17]:
In [18]:
ds[main_var].sel(longitude=longitude, latitude=latitude, method="nearest").groupby(
    "time.month"
).mean().hvplot()
Out[18]:
In [19]:
ds[main_var].sel(time=slice(time_ini, time_end)).sel(
    longitude=longitude, latitude=latitude, method="nearest"
).hvplot()
Out[19]:
In [20]:
import cartopy.crs as ccrs
import numpy as np

import matplotlib.pyplot as plt

proj = ccrs.PlateCarree()
In [21]:
figsize = (10, 5)

zoom = 1

lon_min = ds.coords["longitude"].min()
lon_max = ds.coords["longitude"].max()
lat_min = ds.coords["latitude"].min()
lat_max = ds.coords["latitude"].max()

extent = [
    lon_min - zoom * 0.5,
    lon_max + zoom * 0.5,
    lat_min - zoom * 0.5,
    lat_max + zoom * 0.5,
]

resolution = "50m"
In [22]:
def default_map(axes=None, global_map=False, background=True):
    if axes is None:
        fig, ax = plt.subplots(figsize=figsize, subplot_kw={"projection": proj})
    else:
        ax = axes

    if global_map:
        ax.set_global()
    else:
        ax.set_extent(extent)

    ax.coastlines(resolution=resolution)

    if background:
        ax.stock_img()

    return ax
In [23]:
ax = default_map(global_map=True)

plt.scatter(
    ds.coords["longitude"][0],
    ds.coords["latitude"][0],
    c="navy",
    s=100,
    marker="o",
    edgecolors="white",
    alpha=0.9,
)

plt.show()
In [24]:
ax = default_map(background=False)

lons, lats = np.meshgrid(ds.coords["longitude"], ds.coords["latitude"])

plt.scatter(lons, lats, c="blue", s=1)

plt.show()
In [25]:
import geoviews.feature as gf

ds[main_var].isel(time=0).hvplot(
    "longitude", "latitude", crs=proj, cmap="viridis", width=500, height=500
) * gf.coastline.options(scale=resolution, line_width=2)
Out[25]:
In [26]:
df = ds.sel(longitude=longitude, latitude=latitude, method="nearest").to_dataframe()[
    main_var
]
df
Out[26]:
time
1980-01-01 00:00:00    1.518093
1980-01-01 03:00:00    1.573423
1980-01-01 06:00:00    1.610289
1980-01-01 09:00:00    1.621033
1980-01-01 12:00:00    1.621097
                         ...   
2013-12-31 09:00:00    2.923542
2013-12-31 12:00:00    2.997466
2013-12-31 15:00:00    2.816935
2013-12-31 18:00:00    2.610284
2013-12-31 21:00:00    2.433420
Name: swh, Length: 99352, dtype: float32
In [27]:
df.to_csv("data/swh.csv", header=True)
In [28]:
df.to_csv("data/swh.zip", header=True)  # compression="zip"
In [29]:
%ls -lh data/swh*
-rw-r--r--  1 pedro  staff   2.8M Nov  4 12:04 data/swh.csv
-rw-r--r--  1 pedro  staff   676K Nov  4 12:04 data/swh.zip
In [30]:
%ls -lh data/SIMAR*
-rw-r--r--@ 1 pedro  staff    61M Aug  7  2017 data/SIMAR_1052046
-rw-r--r--  1 pedro  staff   7.2M Nov  4 11:33 data/SIMAR_1052046.zip
In [31]:
import pandas as pd

df_file = pd.read_csv("data/swh.zip", index_col="time", parse_dates=True)
In [32]:
df_file
Out[32]:
swh
time
1980-01-01 00:00:00 1.518093
1980-01-01 03:00:00 1.573423
1980-01-01 06:00:00 1.610289
1980-01-01 09:00:00 1.621033
1980-01-01 12:00:00 1.621097
... ...
2013-12-31 09:00:00 2.923542
2013-12-31 12:00:00 2.997466
2013-12-31 15:00:00 2.816935
2013-12-31 18:00:00 2.610284
2013-12-31 21:00:00 2.433420

99352 rows × 1 columns

In [33]:
df_simar = pd.read_csv(
    "data/SIMAR_1052046",
    delim_whitespace=True,
    skiprows=80,
    parse_dates=[[0, 1, 2, 3]],
    index_col=0,
    header=0,
    na_values=-99.9,
)
In [34]:
df_simar.to_csv("data/SIMAR_1052046.zip", header=True)
In [35]:
import hvplot.pandas

df_file.hvplot()
Out[35]:

Third-party netCDF packages:

  • NCO: package of command line operators that manipulates generic netCDF data and supports some CF conventions
  • CDO: a collection of command-line operators to manipulate and analyze climate and numerical weather prediction data; includes support for netCDF-3, netCDF-4 and GRIB1, GRIB2, and other formats
  • NCL: interpreted language for scientific data analysis and visualization

Main Tutorials (https://oceanhackweek.github.io/curriculum_2019.html)

  • Access and Clean-up Ocean Observation Data
  • Data Visualization Part I: Basics and Geospatial Visualization
  • Advanced Data Visualization
  • Cloud Computing 101
  • Handle "Big" Larger-than-memory Data
  • Pangeo
  • Machine Learning
  • Git, GitHub and Project Collaboration
  • Reproducible Research

Project: Co-Locators

This project aims to provide a way to Co-locate oceanographic data by establishing constraints. The tools developed allow users to provide geospacial bounds in a temporal range to get an aggregated response of all available data.