Use spread()
to convert table2
to table1
. What is the meaning of the key
and value
arguments?
Use gather()
to convert table1
to table2
. Do you need an extra transformation to make the result fully identical? Can you reuse key
and value
from the previous result?
Use a bar chart to visualise the data. Which of table1
or table2
is more suitable for plotting?
Use gather()
to convert table4a
and table4b
to table2
.
Hint: Use bind_rows()
to combine similar tibbles.
Create a scatterplot from the mpg
dataset that shows both highway and city fuel economy against engine displacement with two different colors using only one geom_point()
call.
Find more exercises in Section 12.3.3 of r4ds.
Convert table3
to table1
and table2
.
Convert table2
to table3
.
Use separate()
to compute departure and arrival hours and minutes in the flights
dataset.
Find more exercises in Section 12.4.3 of r4ds.
Do you see a problem in the presidential
dataset? Can you see how does this affect the following bar plot without actually running the code?
presidential %>%
mutate(term = end - start) %>%
ggplot() +
geom_bar(aes(name, term))
How are the flights
, carriers
, and airports
datasets connected? Which are primary, which are foreign keys?
Hint: Use count()
to support your hypothesis.
Plot a heat map of destination by airline for all flights shorter than 300 miles. Use explicit names for the carriers and the destinations. Does the result change if you use a full join? Do you use geom_raster()
or geom_bin2d()
?
Hint: Use by = c("dest" = "faa")
.
Find more exercises in Section 13.4.6 of r4ds.
Find the airports that are serviced by at least one flight. Which airports did not have direct connections in 2013?
Find more exercises in Section 13.5.1 of r4ds.
Create plots for time trends of new HIV, new TB, and new malaria cases for all countries of a high-impact region of your choice. Each panel should show the data for one country.
Hint: Use the gfdata
package.
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_path).
Create a single plot that shows time trends of new HIV, new TB, and new malaria cases for all countries of a high-impact region of your choice. Each panel should show the data for one country and one disease.
Hint: You need to reorganize your data. Use rename()
to bring the columns into a consistent format, and a combination of spread()
, gather()
, and separate(..., into = c("disease", "indicator"), extra = "merge")
to extract the disease from the rest of the column name.
## Warning: Removed 1 rows containing missing values (geom_path).
Create a single plot that shows aggregated time trends of new HIV, new TB, and new malaria cases for all regions, composed of the sum of all cases from each country in a reagion. Each panel should show the data for one region and one disease. At which stage in the data processing do you implement the aggregation?
Create a single plot that shows time trends of mortality or incidence rates for HIV, TB, and malaria cases for all countries of a high-impact region of your choice. Each panel should show the data for one region and one disease.
Hint: you need to create the following new variables first. Use a consistent naming scheme.hiv_new_infections_number
divided by uninfected population in the previous year (t-1) population_excluding_plwha
multiplied by 100aids_deaths_number
divided by total population population_all_ages
multiplied by 100tb_new_cases_number
divided by total population population_all_ages
multiplied by 100,000tb_deaths_number
divided by total population population_all_ages
multiplied by 100,000malaria_new_cases_number
divided by population at malaria risk population_at_malaria_risk
multiplied by 1000malaria_deaths_number
divided by population at malaria risk population_at_malaria_risk
multiplied by 1000Create a single plot that shows aggregated time trends of mortality or incidence rates for HIV, TB, and malaria cases for all regions. Each panel should show the data for one region and one disease.
Hint: You need to sum the numerators across relevant countries and divide it by sum of the denominators across the same set of countries.
Create a single plot that shows time trends of mortality or incidence rates for HIV, TB, and malaria cases for all countries in all high-impact regions, compared against the year 2000 baseline. Each panel should show the data for one region.
Hint: Use first()
with a grouped mutate()
.
people_living_with_hiv_aids_number
and the corresponding lower
and higher
uncertainty ranges) for all countries in all high-impact regions. Each panel should show the data for one country.Load data from sheets 1 to 9 in the Excel file and bring them into a tidy format.
Hint: The path to the Excel file on the RStudio server is "/data/r-course/courswork_data_tgf.xlsx"
. Before reading the file, create a link to the "/data/r-course"
directory using file.symlink()
. Use readxl::read_excel()
and look at the documentation of the sheet
and range
arguments to that function. How do you specify column names?
Use your own data to answer a question about it using the tools you have learned in this course.
Hint: To import, use the “readr” (CSV), “readxl” (Excel), or “haven” (SPSS/SAS/Stata) packages. Use the internet to find out how to import other kinds of data, use as_tibble()
right after importing to get consistent printing.
Alternatively, create a plot of the total number of tuberculosis cases per year for eight countries of your choice. Can you also plot the share (relative to the overall population of the country)?
Copyright © 2017 Kirill Müller. Licensed under CC BY-NC 4.0.