Reshaping tutorial
For this section, I will use data from the Global Historical Climatology Network. I got this idea from the tidyr
tutorial. This example assumes you have already worked through the Beginner tutorial
. Let's take a quick peek at the data.
julia> using LightQuery
julia> import Compat: Iterators
julia> using Base.Iterators: flatten
julia> using Dates: Date
julia> cd(joinpath(pkgdir(LightQuery), "test"));
julia> file = open("climate.txt");
julia> line = readline(file)
"ACW00011604194901TMAX 289 X 289 X 283 X 283 X 289 X 289 X 278 X 267 X 272 X 278 X 267 X 278 X 267 X 267 X 278 X 267 X 267 X 272 X 272 X 272 X 278 X 272 X 267 X 267 X 267 X 278 X 272 X 272 X 272 X 272 X 272 X"
julia> close(file)
Oh boy this is some messy data. Let's start by parsing the first chunk, which contains within it the year and the month.
julia> month_variable = (
year = parse(Int, SubString(line, 12, 15)),
month = parse(Int, SubString(line, 16, 17)),
variable = Symbol(SubString(line, 18, 21))
)
(year = 1949, month = 1, variable = :TMAX)
The next chucks each represent a day. Let's parse a day. missing
is represented by -9999
.
julia> function get_day(line, day)
start = 14 + 8 * day
value = parse(Int, line[start:start + 4])
(day = day, value =
if value == -9999
missing
else
value
end
)
end;
julia> get_day(line, 1)
(day = 1, value = 289)
Now, we can get data for every day of the month.
julia> days = Iterators.map((@_ merge(month_variable, get_day(line, _))), 1:31);
julia> first(days)
(year = 1949, month = 1, variable = :TMAX, day = 1, value = 289)
Use Iterators.filter
to remove missing data;
julia> days = Iterators.filter((@_ _.value !== missing), days);
julia> first(days)
(year = 1949, month = 1, variable = :TMAX, day = 1, value = 289)
Use transform
a true data and remove
the old fields.
julia> get_date(day) =
@> day |>
transform(_, date = Date(_.year, _.month, _.day)) |>
remove(_, name"year", name"month", name"day");
julia> get_date(first(days)) == (
variable = :TMAX,
value = 289,
date = Date("1949-01-01")
)
true
We can combine these steps to process a whole month.
julia> function get_month_variable(line)
month_variable = (
year = parse(Int, SubString(line, 12, 15)),
month = parse(Int, SubString(line, 16, 17)),
variable = Symbol(SubString(line, 18, 21))
)
@> Iterators.map((@_ merge(month_variable, get_day(line, _))), 1:31) |>
Iterators.filter((@_ _.value !== missing), _) |>
Iterators.map(get_date, _)
end;
julia> first(get_month_variable(line)) == (
variable = :TMAX,
value = 289,
date = Date("1949-01-01")
)
true
Each line will return several days. Use flatten
to unnest the data.
julia> climate_data =
@> eachline("climate.txt") |>
Iterators.map(get_month_variable, _) |>
flatten |>
make_columns |>
Rows(; _...);
julia> Peek(climate_data)
Showing 4 of 1231 rows
| variable | value | date |
| --------:| -----:| ----------:|
| :TMAX | 289 | 1949-01-01 |
| :TMAX | 289 | 1949-01-02 |
| :TMAX | 283 | 1949-01-03 |
| :TMAX | 283 | 1949-01-04 |
Let's sort and group the data by date. This will allow us to combine various measures into one row for each date.
julia> by_date =
@> climate_data |>
order(_, name"date") |>
Group(By(_, name"date"));
julia> day_variables = first(by_date);
julia> key(day_variables)
1949-01-01
julia> value(day_variables) |> Peek
Showing 4 of 5 rows
| variable | value | date |
| --------:| -----:| ----------:|
| :TMAX | 289 | 1949-01-01 |
| :TMIN | 217 | 1949-01-01 |
| :PRCP | 0 | 1949-01-01 |
| :SNOW | 0 | 1949-01-01 |
Use transform
to add all the new variables to the day key.
julia> spread_variables(day_variables) = transform((date = key(day_variables),);
Iterators.map(
(@_ _.variable => _.value),
value(day_variables)
)...
);
julia> spread_variables(day_variables) == (
date = Date("1949-01-01"),
TMAX = 289,
TMIN = 217,
PRCP = 0,
SNOW = 0,
SNWD = 0
)
true
Finally, we can run this reshaping over each day in the data using Iterators.map
.
julia> @> by_date |>
Iterators.map(spread_variables, _) |>
Peek
Showing at most 4 rows
| WT16 | date | TMAX | TMIN | PRCP | SNOW | SNWD |
| -------:| ----------:| ----:| ----:| ----:| ----:| ----:|
| missing | 1949-01-01 | 289 | 217 | 0 | 0 | 0 |
| 1 | 1949-01-02 | 289 | 228 | 30 | 0 | 0 |
| missing | 1949-01-03 | 283 | 222 | 0 | 0 | 0 |
| 1 | 1949-01-04 | 283 | 233 | 0 | 0 | 0 |