Problem: the size and variety of datasets have increased exponentially and new data science methodologies have appeared
from IPython.display import IFrame
IFrame("http://opendap.puertos.es/thredds/catalog.html", width=1200, height=600)
%ls data/*.nc
data/significant_height_of_combined_wind_waves_and_swell.nc data/total_precipitation.nc
import xarray as xr
ds = xr.open_dataset("data/total_precipitation.nc")
ds
<xarray.Dataset> Dimensions: (latitude: 17, longitude: 17, time: 99352) Coordinates: * longitude (longitude) float32 -24.0 -23.875 -23.75 ... -22.25 -22.125 -22.0 * latitude (latitude) float32 17.0 16.875 16.75 16.625 ... 15.25 15.125 15.0 * time (time) datetime64[ns] 1980-01-01 ... 2013-12-31T21:00:00 Data variables: tp (time, latitude, longitude) float32 ... Attributes: Conventions: CF-1.6 history: 2019-03-09 15:31:29 GMT by grib_to_netcdf-2.10.0: /opt/ecmw...
ds = xr.open_mfdataset("data/*.nc", combine="by_coords") # , parallel=True
ds
<xarray.Dataset> Dimensions: (latitude: 17, longitude: 17, time: 99352) Coordinates: * longitude (longitude) float32 -24.0 -23.875 -23.75 ... -22.25 -22.125 -22.0 * latitude (latitude) float32 17.0 16.875 16.75 16.625 ... 15.25 15.125 15.0 * time (time) datetime64[ns] 1980-01-01 ... 2013-12-31T21:00:00 Data variables: swh (time, latitude, longitude) float32 dask.array<chunksize=(99352, 17, 17), meta=np.ndarray> tp (time, latitude, longitude) float32 dask.array<chunksize=(99352, 17, 17), meta=np.ndarray> Attributes: Conventions: CF-1.6 history: 2019-03-09 05:31:25 GMT by grib_to_netcdf-2.10.0: /opt/ecmw...
ds_noaa = xr.open_dataset(
"http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/noaa.ersst.v5/sst.mnmean.nc"
)
ds_noaa
<xarray.Dataset> Dimensions: (lat: 89, lon: 180, nbnds: 2, time: 1989) Coordinates: * lat (lat) float32 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0 * lon (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0 * time (time) datetime64[ns] 1854-01-01 1854-02-01 ... 2019-09-01 Dimensions without coordinates: nbnds Data variables: time_bnds (time, nbnds) float64 ... sst (time, lat, lon) float32 ... Attributes: climatology: Climatology is based on 1971-2000 SST, X... description: In situ data: ICOADS2.5 before 2007 and ... keywords_vocabulary: NASA Global Change Master Directory (GCM... keywords: Earth Science > Oceans > Ocean Temperatu... instrument: Conventional thermometers source_comment: SSTs were observed by conventional therm... geospatial_lon_min: -1.0 geospatial_lon_max: 359.0 geospatial_laty_max: 89.0 geospatial_laty_min: -89.0 geospatial_lat_max: 89.0 geospatial_lat_min: -89.0 geospatial_lat_units: degrees_north geospatial_lon_units: degrees_east cdm_data_type: Grid project: NOAA Extended Reconstructed Sea Surface ... original_publisher_url: http://www.ncdc.noaa.gov References: https://www.ncdc.noaa.gov/data-access/ma... source: In situ data: ICOADS R3.0 before 2015, N... title: NOAA ERSSTv5 (in situ only) history: created 07/2017 by PSD data using NCEI's... institution: This version written at NOAA/ESRL PSD: o... citation: Huang et al, 2017: Extended Reconstructe... platform: Ship and Buoy SSTs from ICOADS R3.0 and ... standard_name_vocabulary: CF Standard Name Table (v40, 25 January ... processing_level: NOAA Level 4 Conventions: CF-1.6, ACDD-1.3 metadata_link: :metadata_link = https://doi.org/10.7289... creator_name: Boyin Huang (original) date_created: 2017-06-30T12:18:00Z (original) product_version: Version 5 creator_url_original: https://www.ncei.noaa.gov license: No constraints on data access or use comment: SSTs were observed by conventional therm... summary: ERSST.v5 is developed based on v4 after ... dataset_title: NOAA Extended Reconstructed SST V5 data_modified: 2019-10-03 DODS_EXTRA.Unlimited_Dimension: time
lazy loading of:
- on-disk datasets or
- remote
main_var = "swh"
aux_var = "tp"
ds[main_var]
<xarray.DataArray 'swh' (time: 99352, latitude: 17, longitude: 17)> dask.array<open_dataset-0b05cd88f78082bb538e7cda359a3130swh, shape=(99352, 17, 17), dtype=float32, chunksize=(99352, 17, 17), chunktype=numpy.ndarray> Coordinates: * longitude (longitude) float32 -24.0 -23.875 -23.75 ... -22.25 -22.125 -22.0 * latitude (latitude) float32 17.0 16.875 16.75 16.625 ... 15.25 15.125 15.0 * time (time) datetime64[ns] 1980-01-01 ... 2013-12-31T21:00:00 Attributes: units: m long_name: Significant height of combined wind waves and swell
time_ini = "2000-01-01"
time_end = "2002-12-31"
longitude = -22
latitude = 16
ds[main_var].sel(time=slice(time_ini, time_end)).plot()
(array([6.18100e+04, 7.18842e+05, 9.41067e+05, 5.19233e+05, 2.11784e+05, 6.11640e+04, 1.61720e+04, 3.19200e+03, 4.64000e+02, 2.24000e+02]), array([0.76566815, 1.1487132 , 1.5317584 , 1.9148035 , 2.2978487 , 2.680894 , 3.0639389 , 3.446984 , 3.8300292 , 4.213074 , 4.5961194 ], dtype=float32), <a list of 10 Patch objects>)
The purpose of visualization is insight, not pictures
― Ben Shneiderman
import hvplot.xarray
ds[main_var].sel(time=slice(time_ini, time_end)).hvplot()
ds[main_var].sel(longitude=longitude, latitude=latitude, method="nearest").hvplot()
ds[main_var].sel(longitude=longitude, latitude=latitude, method="nearest").groupby(
"time.month"
).mean().hvplot()
ds[main_var].sel(time=slice(time_ini, time_end)).sel(
longitude=longitude, latitude=latitude, method="nearest"
).hvplot()
import cartopy.crs as ccrs
import numpy as np
import matplotlib.pyplot as plt
proj = ccrs.PlateCarree()
figsize = (10, 5)
zoom = 1
lon_min = ds.coords["longitude"].min()
lon_max = ds.coords["longitude"].max()
lat_min = ds.coords["latitude"].min()
lat_max = ds.coords["latitude"].max()
extent = [
lon_min - zoom * 0.5,
lon_max + zoom * 0.5,
lat_min - zoom * 0.5,
lat_max + zoom * 0.5,
]
resolution = "50m"
def default_map(axes=None, global_map=False, background=True):
if axes is None:
fig, ax = plt.subplots(figsize=figsize, subplot_kw={"projection": proj})
else:
ax = axes
if global_map:
ax.set_global()
else:
ax.set_extent(extent)
ax.coastlines(resolution=resolution)
if background:
ax.stock_img()
return ax
ax = default_map(global_map=True)
plt.scatter(
ds.coords["longitude"][0],
ds.coords["latitude"][0],
c="navy",
s=100,
marker="o",
edgecolors="white",
alpha=0.9,
)
plt.show()
ax = default_map(background=False)
lons, lats = np.meshgrid(ds.coords["longitude"], ds.coords["latitude"])
plt.scatter(lons, lats, c="blue", s=1)
plt.show()
import geoviews.feature as gf
ds[main_var].isel(time=0).hvplot(
"longitude", "latitude", crs=proj, cmap="viridis", width=500, height=500
) * gf.coastline.options(scale=resolution, line_width=2)
df = ds.sel(longitude=longitude, latitude=latitude, method="nearest").to_dataframe()[
main_var
]
df
time 1980-01-01 00:00:00 1.518093 1980-01-01 03:00:00 1.573423 1980-01-01 06:00:00 1.610289 1980-01-01 09:00:00 1.621033 1980-01-01 12:00:00 1.621097 ... 2013-12-31 09:00:00 2.923542 2013-12-31 12:00:00 2.997466 2013-12-31 15:00:00 2.816935 2013-12-31 18:00:00 2.610284 2013-12-31 21:00:00 2.433420 Name: swh, Length: 99352, dtype: float32
df.to_csv("data/swh.csv", header=True)
df.to_csv("data/swh.zip", header=True) # compression="zip"
%ls -lh data/swh*
-rw-r--r-- 1 pedro staff 2.8M Nov 4 12:04 data/swh.csv -rw-r--r-- 1 pedro staff 676K Nov 4 12:04 data/swh.zip
%ls -lh data/SIMAR*
-rw-r--r--@ 1 pedro staff 61M Aug 7 2017 data/SIMAR_1052046 -rw-r--r-- 1 pedro staff 7.2M Nov 4 11:33 data/SIMAR_1052046.zip
import pandas as pd
df_file = pd.read_csv("data/swh.zip", index_col="time", parse_dates=True)
df_file
swh | |
---|---|
time | |
1980-01-01 00:00:00 | 1.518093 |
1980-01-01 03:00:00 | 1.573423 |
1980-01-01 06:00:00 | 1.610289 |
1980-01-01 09:00:00 | 1.621033 |
1980-01-01 12:00:00 | 1.621097 |
... | ... |
2013-12-31 09:00:00 | 2.923542 |
2013-12-31 12:00:00 | 2.997466 |
2013-12-31 15:00:00 | 2.816935 |
2013-12-31 18:00:00 | 2.610284 |
2013-12-31 21:00:00 | 2.433420 |
99352 rows × 1 columns
import hvplot.pandas
df_file.hvplot()
This project aims to provide a way to Co-locate oceanographic data by establishing constraints. The tools developed allow users to provide geospacial bounds in a temporal range to get an aggregated response of all available data.