This workshop was presented as part of a webinar, see https://www.cbioportal.org/tutorials. It might be useful to look at the slides before following the steps here.
For this workshop we use R with the cBioPortalData
and AnVIL
packages. We will also be using RStudio.
The easiest way to do the workshop is to run RStudio on mybinder.org by clicking this link:
This spins up a machine for you on a remote server that you can control through your browser. This Binder environment comes pre-installed with the necessary packages. In RStudio open workshop.Rmd
. If you open this document in Rstudio it should show a play icon at the top right of the code block below. Click on it and it should run without errors.
library(cBioPortalData)
library(AnVIL)
After the workshop you might want to install RStudio and cBioPortalData locally on your own machine.
If you already have R>=4.0.0 and RStudio installed and are familiar with installing packages you can install cBioPortalData with:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("cBioPortalData", quietly = TRUE))
BiocManager::install("cBioPortalData")
library(cBioPortalData)
Alternatively you can use bioconda. This is probably the fastest way to install it if you don’t have R and Bioconductor installed yet. After following the instructions on the Bioconda website you can create an environment with cBioPortalData
like this:
conda create -n cbioportaldata bioconductor-cbioportaldata
conda activate cbioportaldata
To use cBioPortalData
in RStudio you first need to download RStudio from their website. Once installed you can open RStudio from the command line inside of the bioconda
environment, e.g. for Mac it’s like this:
conda activate cbioportaldata
open -na /Applications/RStudio.app
As a rule of thumb we recommend to create one environment per project/analysis you are working on.
Bioconductor also offers Docker images and Amazon AMIs. cBioPortalData is part of Bioconductor since version 3.11
.
Now that the tedious part of installing software is over we can start to use the API.
Initialize the API with the following commands:
cbio <- cBioPortal()
cbio
## service: cBioPortal
## tags(); use cbioportal$<tab completion>:
## # A tibble: 70 x 3
## tag operation summary
## <chr> <chr> <chr>
## 1 A. Cancer Types getAllCancerTypesUsingGET Get all cancer types
## 2 A. Cancer Types getCancerTypeUsingGET Get a cancer type
## 3 B. Studies fetchStudiesUsingPOST Fetch studies by IDs
## 4 B. Studies getAllStudiesUsingGET Get all studies
## 5 B. Studies getStudyUsingGET Get a study
## 6 B. Studies getTagsForMultipleStudiesUsingPOST Get the study tags by IDs
## 7 B. Studies getTagsUsingGET Get the tags of a study
## 8 C. Patients fetchPatientsUsingPOST fetchPatients
## 9 C. Patients getAllPatientsInStudyUsingGET Get all patients in a stu…
## 10 C. Patients getAllPatientsUsingGET Get all patients
## # … with 60 more rows
## tag values:
## A. Cancer Types, B. Studies, C. Patients, D. Samples, E. Sample
## Lists, F. Clinical Attributes, G. Clinical Data, H. Clinical Events,
## I. Molecular Data, J. Molecular Profiles, K. Mutations, L. Discrete
## Copy Number Alterations, M. Copy Number Segments, N. Genes, O. Gene
## Panels, P. Generic Assay, Q. Structural Variants, R. Reference Genome
## Genes, S. Resource Definitions, T. Resource Data
## schemas():
## AlleleSpecificCopyNumber, CancerStudy, CancerStudyTags,
## ClinicalAttribute, ClinicalAttributeCount
## # ... with 51 more elements
This gives an API object in R that allows among other things, access to all the endpoints listed here: https://www.cbioportal.org/api. The object works with tab completion, so if you type cbio$get
and press the Tab button on your keyboard it will suggest names of various endpoints starting with get
:
To get an idea of how the client works, let’s try to answer a few questions using the API object we just created:
There are several ways to get these answer. One way would be to try and determine the name of the endpoint that would be likely to return this data.
You can list the endpoints by using the Tab completion option shown before or by searching through the endpoints using the cBioPortalData
function searchOps
:
searchOps(cbio, "studies")
## [1] "getAllStudiesUsingGET" "fetchStudiesUsingPOST"
## [3] "getTagsForMultipleStudiesUsingPOST"
The getAllStudiesUsingGet
endpoint seems quite likely to give the response we are looking for to determine the total number of studies in cBioportal.
Another way to find the relevant endpoint could be to e.g. browse the API reference with the list of all the endpoints: https://www.cbioportal.org/api/swagger-ui.html. In the Studies
collection of endpoints it shows there is an endpoint to get all studies:
You could click on “Try it Out”, press the “Execute” button and notice that the response does indeed have all studies.
Lastly, another approach for finding an endpoint is to browse the cbioportal.org website. You can in this case take advantage of the fact that the homepage (www.cbioportal.org) lists how many studies there are:
Note that the total number of studies may differ if you run this workshop at a later date.
To get the same information programmatically, let’s try to figure out what endpoint the homepage is using. Go to www.cbioportal.org and open the “Developer Tools” in your browser (View > Developer Tools). Click on the “Network” tab and filter the requests by api
. Refresh the homepage again. If you look at the /api/studies
endpoint you should see something like this:
There are two other endpoints being used but if you look at the response of each you’ll notice that only the /api/studies
endpoint lists the number of studies. There are 284
elements in the response at time of writing, so we know now that the number of elements in this response corresponds to the number of studies.
Now that we are masters of endpoint searching, let’s try to actually get the data into R using the cBioPortalData
API Object. One could e.g. do
resp <- cbio$getAllStudiesUsingGET()
resp
## Response [http://www.cbioportal.org/api/studies]
## Date: 2020-05-28 12:27
## Status: 200
## Content-Type: application/json;charset=UTF-8
## Size: 148 kB
This gives an object with info about the HTTP request’s response. To parse the response into a more convenient object for analysis use the httr:content
function:
parsedResponse <- httr::content(resp)
cat("Number of elements in the response:", length(parsedResponse))
## Number of elements in the response: 285
Since we know in this case that each element represents a study, we can answer question 1:
cat("Answer 1: There are", length(parsedResponse), "studies in cBioPortal")
## Answer 1: There are 285 studies in cBioPortal
As you can see, this took quite a few steps. The parsing of responses semi-manually becomes tedious, which is why cBioPortalData
has all kinds of convenience functions that do this for you. E.g. instead of having to figure out the endpoint for studies one can simply use the function getStudies
:
# First time you run this command it might ask
# to set up a cache folder like:
#
# Create cBioPortalData cache at
# /home/jovyan/.cache/cBioPortalData? [y/n]:
#
# If that's the case, open the Console in R Studio and press y
studies <- getStudies(cbio)
studies
## # A tibble: 285 x 13
## name shortName description publicStudy pmid citation groups status
## <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <int>
## 1 Chol… Cholangi… Exome sequ… TRUE 2418… Chan-on… "PUBL… 0
## 2 Cuta… CTCL (Co… Whole-Exom… TRUE 2655… Da Silv… "" 0
## 3 Esop… ESCC (UC… Whole exom… TRUE 2468… Lin et … "PUBL… 0
## 4 Oral… Head & n… Comprehens… TRUE 2361… Pickeri… "" 0
## 5 Hepa… HCC (Ins… Whole-exom… TRUE 2582… Schulze… "PUBL… 0
## 6 Uvea… UM (QIMR) Whole-geno… TRUE 2668… Johanss… "PUBL… 0
## 7 Neur… NBL (AMC) Whole geno… TRUE 2236… Molenaa… "PUBL… 0
## 8 Naso… NPC (Sin… Whole exom… TRUE 2495… Lin et … "PUBL… 0
## 9 Thym… TET (NCI) Whole exom… TRUE 2497… Petrini… "PUBL… 0
## 10 Neur… NBL (Col… Whole-geno… TRUE 2646… Peifer … "" 0
## # … with 275 more rows, and 5 more variables: importDate <chr>,
## # allSampleCount <int>, studyId <chr>, cancerTypeId <chr>,
## # referenceGenome <chr>
The getStudies
function returns a special kind of table (a tibble). It allows for easy transformations to help answer the other questions more easily.
You can get the dimensions of the table with dim
(rows x columns):
dim(studies)
## [1] 285 13
So we can answer question 1 now with the studies tibble instead:
cat("Answer 1: There are", nrow(studies), "studies in cBioPortal")
## Answer 1: There are 285 studies in cBioPortal
Let’s see what all the columns are in this table:
colnames(studies)
## [1] "name" "shortName" "description" "publicStudy"
## [5] "pmid" "citation" "groups" "status"
## [9] "importDate" "allSampleCount" "studyId" "cancerTypeId"
## [13] "referenceGenome"
There is a column called cancerTypeId
, which is exactly what we need for question 2:
cat("Answer 2: The studies spans", length(unique(studies$cancerTypeId)), "cancer types")
## Answer 2: The studies spans 88 cancer types
There is also a column caled allSampleCount
with the number of samples for each study. That will help us answer question 3:
cat("Answer 3: There are ", sum(studies$allSampleCount), "samples in cBioPortal")
## Answer 3: There are 86830 samples in cBioPortal
And question 4:
cat("Answer 4: The study with the most samples is", studies[which.max(studies$allSampleCount), "name"][[1]])
## Answer 4: The study with the most samples is MSK-IMPACT Clinical Sequencing Cohort (MSKCC, Nat Med 2017)
So how could you have known in the beginning that this function getStudies
existed and avoid the manual parsing of API responses? There is a function in R to list all functions in a package:
ls("package:cBioPortalData")
## [1] "allSamples" "cBioCache" "cBioDataPack"
## [4] "cBioPortal" "cBioPortalData" "clinicalData"
## [7] "genePanelMolecular" "genePanels" "geneTable"
## [10] "getDataByGenePanel" "getGenePanel" "getGenePanelMolecular"
## [13] "getSampleInfo" "getStudies" "molecularData"
## [16] "molecularProfiles" "removeCache" "sampleLists"
## [19] "samplesInSampleLists" "searchOps" "setCache"
## [22] "studiesTable"
A more user friendly page with all functions in the cBioPortalData
package website can be found at: https://waldronlab.io/cBioPortalData/reference/index.html. Another option is to look at the vignette of the package. Most packages in R include guides on how to use them called “vignettes”. You can open them like this:
# Note that opening the HTML files does not work on mybinder.org:
browseVignettes(package = "cBioPortalData")
## starting httpd help server ... done
In general it is good to first check if there is a function that pulls the data you need in cBioPortalData
. If it’s not there one can resort to parsing the API responses directly as shown before.
For a simple example of visualizing some data from the API we will try to recreate the barchart from the homepage in R:
To make things easier we will use the total number of samples instead of cases. The latter refers to number of patients.
We still have the studies tibble object from before, which we can reuse. A tibble is an object which is part of the R tidyverse. This is an opinionated collection of R packages that are specifically designed for data science. It therefore makes sense to use the dplyr library, which is part of the same universe to manipulate it. They have a nice cheatsheet available that shows common data transformations: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf.
Let’s first get the top 20 counts per cancer type:
cancerTypeCounts <- # assign results
studies %>% # %>% is the pipe operator
group_by(cancerTypeId) %>% # and group by cancer type
summarise(totalSamples=sum(allSampleCount)) %>% # sum allSampleCount, add column
arrange(desc(totalSamples)) %>% # sort by totalSamples
top_n(20) # take top 20
This might be rather complex if you are not used to dplyr
, but don’t worry about that too much for now. In short we manipulate the data by applying a sequence of functions. Let’s print the output:
cancerTypeCounts
## # A tibble: 20 x 2
## cancerTypeId totalSamples
## <chr> <int>
## 1 mixed 18196
## 2 brca 7251
## 3 prad 6819
## 4 coadread 3814
## 5 difg 3302
## 6 luad 2678
## 7 aml 2297
## 8 bll 2144
## 9 gbm 2082
## 10 breast 2059
## 11 blca 1934
## 12 nsclc 1843
## 13 ccrcc 1813
## 14 hgsoc 1680
## 15 ucec 1647
## 16 skcm 1539
## 17 thpa 1512
## 18 hnsc 1478
## 19 nbl 1472
## 20 hcc 1461
In the output we notice a cancertype called “mixed”, which is not in the plot in the homepage. The mixed cancer type indicates that the study contains samples with mixed cancer types. We’ll go ahead and exclude those studies for now:
cancerTypeCounts <-
studies %>%
filter(cancerTypeId != "mixed") %>% # add filter for mixed type
group_by(cancerTypeId) %>%
summarise(totalSamples=sum(allSampleCount)) %>%
arrange(desc(totalSamples)) %>%
top_n(20)
## Selecting by totalSamples
cancerTypeCounts
## # A tibble: 20 x 2
## cancerTypeId totalSamples
## <chr> <int>
## 1 brca 7251
## 2 prad 6819
## 3 coadread 3814
## 4 difg 3302
## 5 luad 2678
## 6 aml 2297
## 7 bll 2144
## 8 gbm 2082
## 9 breast 2059
## 10 blca 1934
## 11 nsclc 1843
## 12 ccrcc 1813
## 13 hgsoc 1680
## 14 ucec 1647
## 15 skcm 1539
## 16 thpa 1512
## 17 hnsc 1478
## 18 nbl 1472
## 19 hcc 1461
## 20 stad 1365
Now let’s try to plot it:
As you might have noticed the counts differ from the homepage. There are fewer samples in our plot. This is because the homepage also includes the counts for the mixed cancer types. To get the proper label names instead of the short cancer type name, one would have to get the full names from /api/cancer-types/
or use the DETAILED
projection on the /api/studies/
endpoint.
The cBioPortal study view shows a variety of charts. A reproduction of some of these charts can be found here.
Aside from making visualizations yourself using thebarplot
and plot
functions (known as r base graphics) or using the popular ggplot2. There are many visualization packages out there to make variations of the plots that one can find on the cBioPortal website. For instance: maftools, GenVisR, ComplexHeatMap. The tricky part is usually transforming the data to work for the particular tool of interest, but once you’ve done it you can reuse the code and apply it to any new study that shows up in cBioPortal. This is one of the powerful features of using an API where one can expect the data to always be in the same format.
Once you start pulling more data from the API you will notice it becomes complicated to manage. In some ways you need a data store in R for easy subsetting into data from multiple assays and the corresponding clinical data. This is the key layer that cBioPortalData adds on top of the API: the MultiAssayExperiment. This object makes it very easy to manage all the data. It also abstracts the complexities of using the REST API directly. For a comprehensive overview it is best to check out the website or look at the vignette:
# Note that opening the HTML files does not work on mybinder.org:
browseVignettes(package = "MultiAssayExperiment")
An example using cBioPortalData MultiAssayExperiment can be found here. It follows some of the examples found in the MultiAssayExperiment
vignette and applies them to data from the TCGA study Lung Invasisve Adenocarcinomas (LUAD):
More general R and Bioconductor