Dendro Research Notebook: Interactive Scientific ......you. Finally I would also like to thank my...
Transcript of Dendro Research Notebook: Interactive Scientific ......you. Finally I would also like to thank my...
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO
Dendro Research Notebook: InteractiveScientific Visualizations for e-Science
Bruno Monteiro Marques
Master in Informatics and Computing Engineering
Orientador: João Miguel Rocha da Silva
Co-orientador: Tiago Nuno Mesquita Folgado Leitão Devezas
May 1, 2020
c© Bruno Marques, 2020
Dendro Research Notebook: Interactive ScientificVisualizations for e-Science
Bruno Monteiro Marques
Master in Informatics and Computing Engineering
May 1, 2020
Resumo
O conceito de Open Science trouxe ao centro das atenções da comunidade científica uma dis-cussão sobre a reprodutibilidade no processo científico. A utilização de visualizações interativascomo uma forma de apresentar dados de investigação, recentemente popularizada pelas tecnolo-gias de notebooks computacionais, é vista como um passo inovador no sentido de obter uma maiorreprodutibilidade.
Um dos requisitos da reprodutibilidade sobre resultados de investigação é a habilidade deanalisar e reutilizar datasets científicos, tornando-se uma característica fundamental de um pro-cesso científico correto e transparente. Para atingir este fim os dados e os seus metadados devemser descritos corretamente de forma a serem replicáveis e poderem ser reutilizados em diferentescontextos. Plataformas como o Jupyter ou distill.pub são bons exemplos de novas maneiras de co-municar cientificamente, permitindo aos utilizadores a construção de notebooks computacionaisque contêm texto, visualizações interativas e código que pode ser visto e partilhado com outrosutilizadores.
O conceito do Dendro Research Notebook nasce da vontade de capturar todos os componentestípicos do processo de investigação sobre a forma de um único documento interativo, enquantosimultaneamente apresentando visualizações descritivas de forma a facilitar o processo de inter-pretação dos dados.
O Dendro Research Notebook procura melhorar as capacidades de processamento, análisee reutilização dos dados tornando-se assim uma plataforma capaz de complementar os métodostradicionais de depósito de metadados com a associação de visualizações e métodos de processa-mento sobre a forma de notebook.
O trabalho desenvolvido nesta dissertação teve como fundamento uma análise de conceitos doestado da arte como “Ciência Aberta” e “Dados Abertos”, repositórios de dados, tecnologias denotebook e plataformas orientadas a publicação científica. Através de uma colaboração próximacom investigadores e utilizadores da plataforma este trabalho procurou estender as capacidades doDendro de forma a integrar um serviço de notebooks computacionais web apto para criar, editar,partilhar e visualizar Jupyter notebooks.
Estas novas capacidades da plataforma Dendro foram demonstradas aos investigadores quecolaboraram de forma próxima com o desenvolvimento desta dissertação, tendo através de umconjunto de sessões experimentais avaliando o sucesso da implementação. Obtendo resultadospredominantemente positivos nos diversos aspetos da implementação analisados, compreendendovisualização, gestão de dados e reprodutibilidade.
Os resultados obtidos, em particular nos aspetos relacionados com a reprodutibilidade, demon-stram a capacidade desta implementação em suportar um novo paradigma de investigação baseadona ciência aberta.
Keywords: Reprodutibilidade, Ciência Aberta, Dados Abertos, Partilha de Dados, Repositóriosde Dados, Web Notebooks
i
ii
Abstract
Open Science has contributed to bring reproducibility, the ability to reproduce research, to thecenter of discussion of scientific communities. By making available research datasets for othersto reanalyse and reuse, scientists can thus contribute to increase the transparency of the scientificprocesses and workflows. To this end, metadata and data should be well-described so that they canbe replicated and/or combined in different settings.
Innovative means to interpret existing scientific results and contribute to reproducibility havebeen developed in recent years. One such way is the combined usage of interactive visualiza-tions and web notebook technologies. Platforms like Jupyter or distill.pub are examples of a newapproach to communicate in science, allowing users to build computational notebooks that cancontain text, interactive visualizations and code that can be shared and viewed by other users.
The concept of research notebooks stems from the idea that the ability to capture all typicalcomponents of a research study in a common interactive document form, while presenting inter-esting visualizations that assist in data interpretation, is an effective way to leverage the universalappeal of visual representations to improve reproducibility. Dendro Research Notebook aims tofoster reproducibility by improving the processing, analyzing and reusing aspects of scientific datalife cycle by allowing users to interact and share data processing methods within the Dendro plat-form via integrated web notebooks. The goal is therefore to make Dendro a platform capable ofcomplementing traditional metadata records with visualization and data processing through theintegration of notebooks.
The work developed in this dissertation was supported on the analysis of state of the art con-cepts like “Open Science” and “Open Data”, data management platforms, computational note-books technologies and publication-oriented platforms. By working closely with researchers andusers of the platform, we extended the capabilities of Dendro with a computational web note-book service ready to handle creating, editing, persisting, sharing and visualization of Jupyternotebooks. These new features of the platform were assessed by several researchers that collabo-rated with this dissertation, through a series of experimental sessions evaluating the success of theimplemented work.
The results obtained were predominantly positive in the evaluated dimensions of the imple-mented work, namely visualization, data management and reproducibility. The favorable results,particularly the ones related with reproducibility, indicate the ability of the developed work insupporting a new paradigm of scientific research based on Open Science. Its long term impact ishowever dependent on the acceptance, by the users of the platform, of the newly integrated toolsto create, share and annotate their research with notebook-powered visualizations.
Keywords: Reproducibility, Open Science, Open Data, Data Sharing, Data Repositories, WebNotebooks
iii
iv
Acknowledgements
Before proceeding I want to acknowledge some people that made this work possible eitherthrough continuous advice, support and motivation. Firstly I have to thank both my supervisorsJoão Miguel Rocha da Silva and Tiago Nuno Mesquita Folgado Leitão Devezas for their tirelessdedication and patience towards me and this dissertation. The example set by their work ethic willhave a long lasting effect in my perception of respect, thoroughness and persistence in their workand in the scientific method. I want thank my supervisors for always pushing and motivating meto excel, the best aspects of this dissertation are certainly a consequence of the effort put forwardby them.
I also want to thank my family for always supporting my decisions devoting so much time andeffort to provide me with the opportunity to study and dedicate myself to my passions. I certainlywouldn’t be able to complete this work without their support. To my mother and my father throughtheir example of perseverance and determination that always motivated me to reach further, thankyou.
Finally I would also like to thank my friends, the family I built during these years I spent inuniversity, I could not have been blessed to have better companions than you. I can’t imagine everachieving what I have today if not for the belief we have in each other. I truly feel like under manycircumstances I carried onward thanks to your immense support alone and I hope only to be ableto repay that one day. The unimaginable amount of encouragement we provided both through thebest and worst moments of these years will one day seem trivial, but I know the memory of whatyou have done for me, I will carry for a lifetime, thank you.
Bruno Marques
v
vi
“An expert is a person who has found out by his own painful experienceall the mistakes that one can make in a very narrow field.”
Niels Bohr
vii
viii
Contents
1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Review 52.1 Open Science and Open Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 FOSTER Open Science Training Tools . . . . . . . . . . . . . . . . . . 72.1.2 Barriers to Open Science . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 The FAIR Guiding Principles for Scientific Data Management . . . . . . 92.1.4 Scientific Data Management Platforms . . . . . . . . . . . . . . . . . . 102.1.5 Data Management Platform Requirements . . . . . . . . . . . . . . . . . 102.1.6 Dendro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Web Notebooks for Open Science . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 State of the Art in Web Notebooks . . . . . . . . . . . . . . . . . . . . . 122.2.2 Popularity of Notebook Technologies . . . . . . . . . . . . . . . . . . . 132.2.3 Jupyter Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.4 Distill.Pub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.5 Advantages for Researchers . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Data Visualization in Open Science . . . . . . . . . . . . . . . . . . . . . . . . 162.3.1 What is Data Visualization? . . . . . . . . . . . . . . . . . . . . . . . . 162.3.2 The Value of Data Visualization . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Data Visualization in Open Science . . . . . . . . . . . . . . . . . . . . 18
3 Methodological Approach 253.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Selecting Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.5 Notebook Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.6 Iterative Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.1 Expert User Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.6.2 Interview Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Implementation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.8 Expert User Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8.1 Interview with Expert User . . . . . . . . . . . . . . . . . . . . . . . . . 303.8.2 Extracted Deductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9 Jupyter Notebook Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
x CONTENTS
3.9.1 Jupyter Notebook Docker Structure . . . . . . . . . . . . . . . . . . . . 323.9.2 Reverse Proxy Implementation . . . . . . . . . . . . . . . . . . . . . . . 343.9.3 Synchronization and Data Management . . . . . . . . . . . . . . . . . . 36
3.10 Jupyter Notebook Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.11 Difficulties and Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.12 Digressions from the initial implementation . . . . . . . . . . . . . . . . . . . . 41
4 Outcomes and Experiments 434.1 Implementation Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Jupyter Notebook Integration . . . . . . . . . . . . . . . . . . . . . . . 434.1.2 Notebook Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Experimental Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.1 Usability Test Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.2 Experimental Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.4 System Usability Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2.5 User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Conclusions and Future work 655.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References 69
A Implementation Overview Questionnaire 73
B System Usability Scale Form 79
C Introductory Guide to System Testing Procedure 83
List of Figures
2.1 Promoting openness at different stages of the research process [35]. . . . . . . . 62.2 Relative analysis of total number of Open Access publications per year [10]. . . . 72.3 Relative analysis of total number of Open Access publications per country [10]. . 82.4 Open Science taxonomy [35]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Data Life Cycle [40]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 CKAN visualization example provided by Recline Data Viewer [8]. . . . . . . . 112.7 Notebook technologies comparison based on Github metrics [28]. . . . . . . . . 142.8 Example of paper at Distill.pub [18]. . . . . . . . . . . . . . . . . . . . . . . . . 152.9 Value of visualization scheme proposed by [50]. . . . . . . . . . . . . . . . . . . 172.10 A popularity comparison between some of the most used libraries in Notebooks [28]. 192.11 An interactive visualization of connections among major U.S. airports in 2008 [51]. 202.12 This example depicts character co-occurrences in Victor Hugo’s Les Misérables [51]. 202.13 Estimation of PI by randomly sampling points and counting how many of them
fall inside or outside a unit circle [51]. . . . . . . . . . . . . . . . . . . . . . . . 212.14 The Flare visualization toolkit package hierarchy and imports [24]. . . . . . . . . 222.15 U.S. counties vote shift [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.16 An example showing the streamlined nature of matplotlib’s visualizations [25]. . 22
3.1 Implementation Scheme Representing Docker Structure Within the Integration . 333.2 Container and Volume naming structure . . . . . . . . . . . . . . . . . . . . . . 353.3 Reverse Proxy Diagram Structure . . . . . . . . . . . . . . . . . . . . . . . . . 363.4 Flow Chart of Notebook Creation and Restoration Systems . . . . . . . . . . . . 383.5 Notebook Monitor Job schema . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Notebook Creation Flow Chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Notebook Start/Restore Flow Chart. . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Active Notebook Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Interaction With Active Notebook Execution. . . . . . . . . . . . . . . . . . . . 484.5 Using Dendro to Explore the Notebook File System. . . . . . . . . . . . . . . . 494.6 Notebook Viewer Static View. . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.7 Notebook Viewer Rendering Imported Notebook View. . . . . . . . . . . . . . . 504.8 Characterization of subjects through level of experience with web notebooks, data
management and processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.9 Graphical representation of time to task in “Creation of a Notebook”. . . . . . . . 554.10 Graphical representation of time to task in “Upload/Creation of Notebook Files”. 564.11 Graphical representation of time to task in “Executing the Notebook”. . . . . . . 564.12 Graphical representation of time to task in “Sharing and Starting existing Notebook”. 564.13 Graphical representation of time to task in “Previewing an External Notebook”. . 57
xi
xii LIST OF FIGURES
4.14 Cumulative values for navigation errors. . . . . . . . . . . . . . . . . . . . . . . 584.15 Cumulative values for presentation errors. . . . . . . . . . . . . . . . . . . . . . 584.16 Different scoring analysis methods associated with raw SUS scores. . . . . . . . 604.17 SUS scores represented in red against percentile and grade graph . . . . . . . . . 614.18 Results from the Implementation Appreciation Form . . . . . . . . . . . . . . . 62
List of Tables
2.1 A comparison of several popular Research Notebook frameworks . . . . . . . . . 13
3.1 Use case table for the creation of a Notebook . . . . . . . . . . . . . . . . . . . 263.2 Use case table for launching a Notebook . . . . . . . . . . . . . . . . . . . . . . 263.3 Use case table for the execution of a Notebook . . . . . . . . . . . . . . . . . . . 263.4 Use case table for the download of a Notebook . . . . . . . . . . . . . . . . . . 263.5 Use case table for the editing of metadata in a Notebook . . . . . . . . . . . . . 273.6 Use case table for the previewing of a Notebook . . . . . . . . . . . . . . . . . . 273.7 Use case table for the search of a Notebook . . . . . . . . . . . . . . . . . . . . 27
4.1 Time taken to complete each of the experimental tasks . . . . . . . . . . . . . . 554.2 Results of the System Usability Survey . . . . . . . . . . . . . . . . . . . . . . . 604.3 Results for satisfaction form in table form . . . . . . . . . . . . . . . . . . . . . 63
xiii
xiv LIST OF TABLES
Abbreviations
RDM Research Data ManagementOA Open AccessUS United StatesCSV Comma Separated ValuesINESC TEC Instituto de Engenharia de Sistemas e Computadores, Tecnologia e CiênciaAPI Aplication Programming Interfaceipynb IPython Notebook file extensionURL Uniform Resource LocatorFEUP Faculty of Engineering of the University of Porto
xv
Chapter 1
Introduction
Computational notebooks are virtual programming environments that often combine the abil-
ities of a word processing software with a kernel and shell of a respective programming language.
These computational notebooks, sometimes designated as notebook interfaces, find usage amongst
publishers, data analysts, universities and others, including personal use [41].
This concept was firstly introduced as the Wolfram Mathematica in 1988. It featured a note-
book like graphical interface combined with a processing kernel capable of executing the Wolfram
language [29]. As time went on a number of different kernels for different languages were intro-
duced ranging from MATLAB 1, Pyhton 2, Julia 3, among others [54]. In recent years, notebooks
have found utility by providing interactive ways to store data in a easily presentable condition,
bridging the gap between storing data, annotation and presentation. The ability to execute code
within notebook environments creates an opportunity for users looking for an uncomplicated way
to build their code, making computational notebooks an attractive tool for researchers. These com-
putational notebooks’ ability to retain data alongside processing code and visual representations
for future analysis and distribution create a environment for reproducibility [28].
Data visualization on the other hand has for long played an important role in exploration, anal-
ysis and presentation of scientific data [52]. Producing visualizations to present research findings
has always been an integral step of the scientific process through advances in graphics hardware,
combined with the development of new visualization methods, techniques and systems which
have provided new ways to interpret scientific research [50, 38]. Widgets allowing scientists to
execute and aggregate visualizations based on the code processing data all captured within the
notebook environment push towards a paradigm on scientific data being presented via interactive
visualizations. These visualizations can offer several benefits to conventional presentation such
as the refining of parameters/attributes, object manipulation tasks (selection, translation, rotation,
scaling, aggregation and time), filtering of data sets, among others [38, 28].
Finally the term e-Science dates back to 1999 referring to computationally intensive science
that is carried out in highly distributed network environments using large data sets [4]. Since the
1mathworks.com2ipython.org/3github.com/JuliaLang/IJulia.jl
1
2 Introduction
term’s inception that global scientific collaboration has evolved to take on many forms, however
from the various initiatives around the world, a consensus is emerging: research collaboration
should aim to be “open” or at least there should be a substantial measure of “open access” to the
data and information underlying published research [13].
In the interest of supporting this paradigm shift towards a more Open Science it is important
to provide researchers and other interested counterparts tools that are capable of accommodating
this change. A scenario where most of the steps of the scientific research are encapsulated in a
convenient platform supporting collaboration and sharing is ideal in order to support this effort.
1.1 Context
This increased prevalence of Open Science has brought reproducibility to the center of discus-
sion of the scientific community. The ability to reproduce scientific research is a requirement for
ensuring the transparency and correctness of the research workflow. This focus on reproducibil-
ity can be seen through various publications, platforms, technologies and other elements of the
scientific process [9, 45].
The Dendro platform is a scientific data management platform that has been under develop-
ment at FEUP. It aims to facilitate the management of data sets, documents and other digital mate-
rials produced by research groups. Dendro can extract and index contents from files in a seamless
manner, allowing users to upload files in different formats like CSV or Excel, and automatically
store them in a Big Data database suited to be queried [12]. This platform seeks to provide its users
with effective tools that allow them to pursue these new paradigms in scientific research looking
to associate its methods of storing data with solutions to increase reproducibility. The usage of
interactive visualizations heightened by the expansion of the notebook interface technology can
be seen in platforms like distill.pub and the popular Jupyter Notebook. They are good examples
of emerging technologies associated with reproducibility and by effect Open Science [14, 38].
This was the origin to the idea behind this dissertation: combining the strong points of note-
book interfaces, in particular their flexibility in providing interactive visualizations, with Dendro’s
data storing environment, empowering its users with a tool that allows for depositing data, execut-
ing code over that data and create visualizations, therefore increasing the appeal for reproducibil-
ity.
To understand the impact that the inclusion of computational notebooks can have in a platform
like Dendro we must discuss the most prominent and widely used notebook interface at this time.
The Jupyter technology is an open-source software for interactive computing that supports dozens
of programming languages, which is one of the factors that makes it the most attractive for the
scientific community. The support for this platform is extensive, including its dedicated and large
user base and open-source software features. Its multi-language support also makes it a flexible
option for different groups of users [26].
1.2 Motivation and Goals 3
Jupyter technology allows scientists to create their own dynamic notebooks, with prose, in-
teractive visualizations even code snippets that can be shared and viewed by other subjects. Dis-
till.pub is an open access scientific journal that uses these tools to expand the way scientific articles
are written [14, 38].
1.2 Motivation and Goals
This dissertation looks to bring these aspects together in Dendro, trying to elevate the platform
to a standard that will fulfill the necessities of the most updated guidelines for reproducibility and
Open Science [53, 41, 35].
With this proposal we will be able to observe how the ability to pair data and processing
methodologies with annotations and or visualizations will increase the value of said data appealing
to reproducibility.
This endeavour looks to bring Notebook Interfaces to the Dendro platform setting out to further
develop it into a powerful research tool.
With “Dendro Research Notebook“ the deposited data in Dendro will integrate the strengths of
both the notebook and the Dendro platform, combining the visualization and dynamic interactions
of notebook platforms with the already established data management tools at Dendro. This leads
us to the hypotheses:
The integration of sharing, processing and visualization aspects of computational
notebooks in data management platforms will improve reproducibility and encour-
age Open Science.
In order to fulfil this hypothesis, this work sets out with goals mainly focused on inciting
reproducibility. Dendro as a research notebook should allow for the creation of notebooks from the
platform, sharing file and data structure directly to the notebook. The way data is described within
Dendro will be expanded upon by the notebook, allowing researchers to describe their works
with visualizations, executable code and text. Researchers will also be able to easily find other
notebooks within the public repositories in the platform in order to facilitate reuse of scientific
data. Interactive visualizations alongside snippets of the notebook will be integrated with the data
and existing project descriptions. Researchers will also be able to download and edit metadata
belonging to the notebook.
To perform this implementation, consultation with researchers will be crucial in order to pin-
point their necessities. Finally, to assess the value of the implementation, various evaluation tasks
will be performed in order to assess some aspects of usability, like effectiveness, efficiency and
satisfaction. This work will demonstrate how the visualization solutions and interactivity with
scientific data brings incentive to reusability, contributing for a better data management workflow
in the future.
4 Introduction
We hope that this dissertation will contribute to the research data management field by suc-
cessfully modelling a new workflow where data is closely connected to its presentation and pro-
cessing, concentrating several steps of the scientific research in the same tool and keeping them
closely linked will emphasize the importance of bridging the gap between data and visualization.
This will develop a bigger concern with the reproducibility of data and generally improve these
aspects within Dendro.
1.3 Document Structure
This document in addition to the introductory chapter, has a literature review, a methodological
approach chapter, an experiments and outcomes chapter, and a conclusions section.
The literature review section is divided into three parts. In order to understand the motivation
behind this work we must analyse how Open Science has come to the attention of the scientific
community and what are its current requirements. We should also have a brief overview of current
notebook technologies in order to correctly assess which technology to implement on the platform,
as well as study visualization and its contribute to data reproducibility.
Starting with Open Science and Open Data, we will look to define them and their obstacles,
and also explore the data life cycle aspects, FAIR principals, web notebooks, and their interactions
with Open Science, reproducibility and data managements platforms. Finally we will overview
visualization under an Open Science context and its possible impacts.
In the methodology approach section, we will go over all of the implementation process from
problem formulation, requirements inquiry and implementation. Starting with requirements gath-
ering, we will discuss the problem formulation, use cases and technology selection. We will then
present a structural modeling of the implementation and describe the implementation process in
detail.
In the experiments and results sections, we will discuss both the experimental procedure that
validated this dissertation’s work and the satisfaction with the implemented solution. We start
with an analysis of the implementation outcomes, describing all of the implemented features and
measuring the degree of success of the implementation. We then present an experimental analysis
over these features in order to better evaluate the success this dissertation achieved from the end
users’ standpoint.
Finally, the last chapter overviews all of the performed work in this dissertation, its impact,
and possibilities for expansion in future work.
Chapter 2
Literature Review
This section of the dissertation addresses the current state of the art of the elements necessary
to understand this work. These range from different fields, so a careful approach will be used
in order to understand the different aspects of these fields that combine into creating the solution
this work came to. There are three main aspects that should be understood to grasp the concepts
fundamental to this dissertation: Open Science, web notebooks and visualization. It is essential
to understand the impact that open science has had on the scientific community especially in
order to understand the motivation behind this thesis. Web notebooks should be overviewed in
order to understand the possibilities these platforms provide and determine which technologies are
right for this dissertation’s work. Finally data visualization is an integral part of this dissertation
and one of the main aspects in which this work can push the Dendro platform to an effective
reproducible research tool and therefore an analysis of the visualization solutions that integrate
into these notebooks is necessary.
2.1 Open Science and Open Data
e-Science is a term dating back to 1999 first used by John Taylor, director of the UK’s science
and technology office at the time. It refers to computationally intensive science that is carried out
in highly distributed network environments, using immense data sets that require grid comput-
ing [4]. Global scientific collaboration takes many forms, but from the various initiatives around
the world, a consensus is emerging: research collaboration should aim to be “open” or at least
there should be a substantial measure of “open access” to the data and information underlying
published research [13].
Open Science represents a new approach to the scientific process based on cooperative work
and new ways of diffusing knowledge by using digital technologies and new collaborative tools [35].
For researchers, however, this translates into a need to adapt various steps of the scientific process.
The principles of openness should be extended to the whole research cycle, from data collection,
processing, storing, preservation, distribution, reuse and even from hypothesis composition, as
shown in Figure 2.1.
5
6 Literature Review
Figure 2.1: Promoting openness at different stages of the research process [35].
Open Science Monitor 1 analyses data and case studies covering access to scientific publica-
tions. It divides publications into “Gold Open Access” and “Green Open Access”. In Gold OA
(Open Acess), research outputs are made available under an open access license by the publisher
on the journal website. Under Green OA terms, research outputs are not made available by the
publisher, but by the author(s), who independently deposit data and publications in an open access
repository [9]. It should be noted however that “Gold Open Access” is commonly associated with
an Article Processing Charge whose value can reach five thousand dollars in some journals. This
stands as a clear obstacle to a broader existence of Gold Open Access as a standard for scientific
publishing [47].
In Figure 2.2 we can see that there has been a general increase in research work made open
access by the publisher or publication. We can also denote from the figure how Gold Open Access
is gaining popularity over time. This can be seen as a positive since it removes responsibility
from the author to make the research available — this is a step in the right direction since plenty
of projects aim to improve in ways to make their research available. If this were to become the
norm for a majority of projects we could see a big improvement in open science indicators across
research communities [31, 53, 35].
In addition to the increase in Open Access publication we can also see in Figure 2.3 which
countries lead the way in this field. This is not just a scientific trend since countries like the
Netherlands are trying to develop strategies to achieve 100% open access by 2020 [31].
1fosteropenscience.eu/content/open-science-monitor
2.1 Open Science and Open Data 7
Figure 2.2: Relative analysis of total number of Open Access publications per year [10].
2.1.1 FOSTER Open Science Training Tools
FOSTER is an EU project aimed at identifying, enriching and providing training content on
relevant Open Science topics in support of the European Commission‘s Open Science Agenda
in the European Research Area [35]. This project focuses on providing Open Science training
to the European research community with methods such as supporting young researchers in their
compliance with open access policies and looking to integrate OA principles in current research
workflow, while strengthening the institutional training capacity beyond the project.
In Figure 2.4 we can distinctly see how Open Science comprises several fields of work. Rang-
ing from reproducible research, evaluation, tools or even policies. Open Data in particular is
relevant to this work since we plan to integrate these concepts with scientific data management.
However Open Science does come with some challenges.
2.1.2 Barriers to Open Science
In a recent case study of the Netherlands’ Plan on Open Science, a particularly interesting
topic were the barriers encountered [31]. One of the most relevant was the storing and sharing
of data since discipline-specific data protocols within technical and policy-based preconditions
are needed to ensure a consistent and FAIR access to research data [53]. These privacy issues,
proprietary aspects, and ethics are common barriers in open science to all fields, and data manage-
ment platforms are trying to provide solutions to create solid infrastructure to support the reuse of
scholarly data.
8 Literature Review
Figure 2.3: Relative analysis of total number of Open Access publications per country [10].
2.1 Open Science and Open Data 9
Figure 2.4: Open Science taxonomy [35].
2.1.3 The FAIR Guiding Principles for Scientific Data Management
A diverse set of stakeholders from academia, industry, funding agencies, and scholarly pub-
lishers have come together to design and jointly endorse a concise and measurable set of principles
that we refer to as the FAIR Data Principles, with the intent for these to act as a guideline for those
wishing to enhance the reusability of their data holdings [31, 53]. These principles focus on four
mains aspects: findability, accessibility, interoperability and reusability. The findability of data
consists of making sure that data is easy to find by humans and computers, allowing for automatic
discovery of datasets and services crafted with FAIR in mind. Accessibility consists of ensuring
that the assigned data provides a clear path on how it can be accessed. Interoperability is also
fundamental considering the necessity to interoperate data between different workflows, tools,
processing methods and storage. And finally, the ultimate goal is the optimization of the reuse of
data, so steps like making sure data is well described in order to be reused are vital [19].
2.1.3.1 Data Life Cycle
The data life cycle shown in Figure 2.5 is the sequence of stages that a particular unit of data
goes through from its initial generation or capture to its eventual archival and/or deletion at the
end of its useful life [40].
This is one of the cornerstones of scientific data management: respecting this process and
finding ways to optimize each segment of the cycle has been on the core of the developed work in
this area. However, researchers believe there is a “reproducibility crisis”, since the complexity and
extent of scientific experiments and data collections has gone so far that reproducing experiments
and reusing scientific data from other research groups is becoming increasingly difficult [2].
10 Literature Review
Figure 2.5: Data Life Cycle [40].
Considering the data life cycle diagram in Figure 2.5 we can begin to understand in which
steps of the cycle this dissertation work can affect reproducibility in a positive way. The integra-
tion of web notebooks into the workflow can affect almost every step of this cycle one way or
another. However when considering the core function of the notebook, storing, processing and
analysis methods, making them easily interchangeable alongside data itself, cataloguing data and
experiments within notebooks allowing for an interactive way to preserve data, we can see just
how impactful the web notebook can be. Since increasing reproducibility is one of the core as-
pects of this work, the employment of web notebooks’ abilities over the data life cycle opens new
exciting possibilities in this field [38, 54].
2.1.4 Scientific Data Management Platforms
With the growing number of scholarly papers being published and a growing awareness of
the importance, diversity and complexity of data generated in research context, platforms look
to offer solutions designed for managing research data [30]. These solutions are being actively
developed by both open-source communities and data management-related companies focusing
on description and long-term preservation of data [30, 1]. These require detailed, domain-specific
descriptions to be correctly interpreted [39].
2.1.5 Data Management Platform Requirements
Data management platforms provide a variety of functions while maintaining description and
long-term preservation as the core focus of their functionality. Different platforms bring distinct
solutions packages and this variety is what sometimes creates difficulty selecting an appropriate
management tool [1]. In order to find impactful ways this work can affect RDM (Research Data
2.1 Open Science and Open Data 11
Figure 2.6: CKAN visualization example provided by Recline Data Viewer [8].
Management) we should study reference RDM platforms and the features they display in the con-
text of open access. And analyse to which extent the integration of a notebook will cover the
existing features these platforms have to offer. The most important solutions in the same context
as Dendro2 are instances running at both research and government institutions [1]. Platforms like
DSpace3, CKAN4, Zenodo5, Figshare6, ePrints7 and EUDAT8 fill this category and all comprise
some sort of previewing of data. We will look at visualization in particular since promoting inter-
activity and allowing researchers to annotate their data with visualizations is one of the focuses of
this work.
2.1.5.1 Visualization in Scientific Data Management Platforms
Even after a brief inspection it is quite easy to see CKAN as one of the most prominent portal
software framework used for publishing Open Data, used by several governmental portals [32].
Intending to maximise re-use of data, quite in line with the work we propose, CKAN’s datastore
can store structured data and provide access to it via an API. This will simplify the process for
researchers checking and re-using data from earlier research. This is one of the aspects where
CKAN shines the most by creating interactive data visualizations, using the built-in Recline data
viewer.
Visualizations also include map plots of geo-coded data or image files displayed on their re-
source pages. Even though these visualizations are a step in the right direction and retain value in
their convenience to researchers, they still don’t provide the ability to manipulate data or generate
2https://github.com/feup-infolab/dendro3https://duraspace.org/dspace/4https://ckan.org/5https://zenodo.org/6https://figshare.com/7https://www.eprints.org/uk/8https://www.eudat.eu/
12 Literature Review
custom visualizations the same way a web notebook would. The best of both worlds would be to
have some simple automated visualization aspects just the way CKAN provides, while having the
opportunity for users who would want to go more in depth to develop their own visualizations and
manipulate and test over the research data in question in an accessible way.
2.1.6 Dendro
Dendro is a platform with “Dropbox like” capabilities and extended description features. It
works as a data storage and description platform that was designed to help researchers and users
describing their data files. Dendro is also collaborative and is designed to supports users collecting
and describing data, with its roots based on the field of research data management [12].
Dendro is designed to support this collaborative work including features such as metadata
versioning, permission management, editing and rollback and public, private or metadata only
visibility. This makes Dendro a flexible framework for data description. The inclusion of a web
notebook service within this platform can bring advantages, however to understand this we must
look at how web notebook can contribute to research data management.
2.2 Web Notebooks for Open Science
Notebook interfaces were first introduced in a very rudimentary way around 1988 in the Wol-
fram Mathematica 1.0 software for the Macintosh [29]. A notebook interface is a virtual envi-
ronment used for programming which pairs the functionality of a word processing software with a
shell and kernel programming language. As scientific work becomes more computational and data
intensive, research processes and results become more difficult to interpret and reproduce [54].
Web notebooks are a type of web based notebook interface that combines the features of the
notebook with the accessibility of web platforms, creating exciting new opportunities for open
science. The value of notebook interfaces comes from their ability to capture all typical com-
ponents of a research study in a common, interactive and document-like form. Notebooks can
also be displayed in a slide show mode for interaction with decision makers [54]. Therefore, web
notebooks can be seen as an effective way to manage and exchange knowledge among scientific
communities.
2.2.1 State of the Art in Web Notebooks
In order to understand current versatility and capabilities of notebooks we analyze the most
successful notebook technologies on the market. Since its genesis in 1988, notebooks have taken
different forms, evolving alongside recent trends in computer science. From the first iterations of
Wolfram Mathematica9 to cloud based systems like Microsoft’s Azure Notebooks10 or Google Co-
laboratory11, these tools provide different services and have different strong points. Jupyter Note-9https://www.wolfram.com/mathematica/quick-revision-history.html
10https://notebooks.azure.com/11https://colab.research.google.com
2.2 Web Notebooks for Open Science 13
book had its origin in the IPython notebook12 and after enjoying success is now evolving towards
a new revised and more versatile platform JupyterLab13. It also integrates several popular forth-
coming platforms like Google Colaboratory project, so the future looks bright for Jupyter [28].
Table 2.1: A comparison of several popular Research Notebook frameworks
Developer Language Support Environment Multi-User Visualization License Cost
Jupyter Notebook Jupyter Multi-Language Local No ipywidgets BSD FreeJupyterLab Jupyter Multi-Language Local No ipywidgets BSD FreeJupyterHub Jupyter Multi-Language Cloud Yes ipywidgets BSD FreeJupyo Bits and Bots LLC. Python, R, and Julia Cloud Yes ipywidgets Proprietary SubscriptionGoogle Collaboratory Google Python Cloud Yes matplotlib CC FreeObservable Observable Inc. JavaScript Web No Vega and D3 MIT FreeAzure Notebooks Microsoft Python, R, F#, etc. Cloud No ipywidgets MVL VariableR Notebooks R Studio Inc. R, Python, Bash, etc. Local No r2d3 GPL Free
In the comparison shown in Table 2.1 we take a look at some of the current popular notebook
technologies. Many of the options in the market today are Jupyter based like Jupyo or Azure
Notebooks, and present alternatives for multi-user cloud running that allows users to share and
work in their notebooks simultaneously without worrying about server setup or maintenance. The
Google Colab initiative intends to bring these features to its users for free, however it is more
limited in Language Support and it is highly focused on Machine Learning projects. JupyterHub
is a multi-user version of the notebook designed for companies, classrooms and research labs
that aims to be customizable, flexible and highly scalable; it is also the closest available multi-
user experience of the original Jupyter Notebook. Regarding the visualization capabilities of each
platform, on the visualization column we point out some of the more interesting alternatives each
platform provides for data visualization.
Some of the more feature-rich solutions are ipywidgets14, which allows the creation user in-
terface controls for exploring code and data interactively, and therefore allow for a extensively
community support basis for Jupyter Notebooks in this field. R2d315 for R Notebooks also pro-
vides powerful publication focused visualization solutions for the users of the platform. Finally,
Vega, a visualization grammar commonly used in Observable notebooks allows the creation, sav-
ing and sharing of visualization designs in a JSON format, to generate web-based views using
Canvas or SVG combining great versatility with aesthetically pleasant visualizations.
2.2.2 Popularity of Notebook Technologies
An interesting exercise could be to evaluate the popularity of Notebook solutions in order to
try to understand which technologies are most widely used and get bigger community support.
Besides the previous comparison of technologies based on their technical support, we can also try
to obtain some metrics on the applicability of a notebook technology based on their popularity
rating. So, through a comparison based on GitHub metrics seen in Figure 2.7, we can extract
12https://ipython.org/13https://jupyterlab.readthedocs.io/en/stable/14https://ipywidgets.readthedocs.io/en/stable/15https://blog.rstudio.com/2018/10/05/r2d3-r-interface-to-d3-visualizations/
14 Literature Review
Figure 2.7: Notebook technologies comparison based on Github metrics [28].
some deductions relative to the current state of notebooks adoption. This comparison was made
through an analysis of the different official repositories for the notebooks, where the number of
contributors, watchers, forks and stars were utilized in order to establish which of the technologies
had a more active support.
Beyond the presence of well established technologies like SageMath or R Markdown we can
definitively see a predominance of Jupyter based technologies. Besides Jupyter itself, both Jupyter-
Lab and JupyterHub appear as some of the most popular repositories, making Jupyter a definite
leader when comparing developing notebook technologies. Developers wishing to implement
notebook solutions in their work, or simply researchers looking for a tool to accommodate their
workflow should always consider a large array of options and strive for interchangeable methods
in order to promote reproducibility. However if a choice must be made it is quite clear that Jupyter
Notebook is a leading options in this field [28].
2.2.3 Jupyter Notebooks
The Jupyter Notebook is an open-source web application whose uses include: data cleaning
and transformation, numerical simulation, statistical modeling, data visualization, machine learn-
ing, and much more [26]. These notebooks can be shared as executable files, using the Jupyter
Notebook Viewer, or via third party execution/management environments such as Anaconda En-
terprise Notebooks [38]. In addition, JupyterHub is able to create a multi-user Hub which spawns,
manages, and proxies multiple instances of the single-user Jupyter notebook server16 [35]. Due to
its flexibility and customization options, JupyterHub can be used to serve notebooks to a class of
students, a corporate knowledge base, or a scientific research group. Notebooks can produce rich
media output that combines narrative text, static images, code, and dynamic, interactive output
produced by Jupyter interactive widgets. These widgets allow the presenter to execute and visu-
alize results in real time based on the code contained in the notebook. In addition to the standard
notebook layout, Jupyter notebook cells can also be displayed in a PowerPoint presentation style
mode [35].
16https://jupyter.org/hub
2.2 Web Notebooks for Open Science 15
Figure 2.8: Example of paper at Distill.pub [18].
2.2.4 Distill.Pub
Distill.Pub [14] is an interesting platform to analyze, as it started off as an open-access sci-
entific journal aimed at improving communication of machine learning results [34]. This format
shares several similarities with this work’s goals and therefore by considering some of the aspects
of Distill we can try to understand what generates its popularity and which aspects compose the
appeal of this platform [18].
Firstly, the editors behind Distill inferred that the platform had over a million unique readers,
and more than 2.9 million views. Distill papers have been cited 23 times on average, placing
Distill in the top 2 % of academic journals indexed by Journal Citation Reports17. However, it’s
also important to remember that Distill is a very small publication selecting very few papers, as
stated by the editors themselves [18]. The three major areas where Distill’s concept have a major
impact are: interface, engagement and software engineering practices. To quote the authors on
their interface, “Bolstered by the interactivity, it invites readers to step into a way of thinking.”.
This tries to reflect on how upon interacting with the publication itself readers perceive concepts
in a different manner that can stimulate new ways of looking at the subject matter by rethinking
it. Engagement wise there’s an improvement derived from reading a paper, to testing and building
on it. With Distill however we see papers where engagement is a continuous spectrum, composed
by reading, interacting and even being able to reproduce the notebook, as seen in Figure 2.8.
Finally, on the software engineering end, every Distill article is housed within a GitHub repos-
itory, and peer review is conducted through the issue tracker giving readers greater transparency
into the publication process.
17https://distill.pub/2018/editorial-update/
16 Literature Review
2.2.5 Advantages for Researchers
The process of describing the experimental methodology with enough detail as to enable data
sharing and foster result analysis can be tedious. However if an experiment is not well described,
potential re-users and even the creators of data could be unable reproduce prior results. It is
often so much so that it becomes more cost-effective for researchers to try to reproduce data than
to reproduce previous research products [2]. Research Notebooks can help to cope with these
issues, as they serve as a means to directly share the process in a platform that can reproduce it
immediately, while still storing metadata and presenting visualizations over the actual data. This
allows other researchers to perform direct and on-demand analysis on the data, running the code
as the original creator built it, therefore improving the ability of researchers to reproduce the
experiments of others and execute their research work in a much simpler and secure way.
2.2.5.1 Sharing Data Increases Citations
Sharing data requires describing data to make it usable by other researchers, and this is a time
consuming process, so if there is no direct mandate, e.g. from a funding agency, other strong in-
centives need to be in place to convince researchers to invest the time needed. This incentive could
come from a citation advantage. It has been found, through bibliometric analyses, that a citation
advantage for astrophysical papers in core journals exists when the works are associated with data
by bibliographical links [17]. These papers receive on average significantly more citations per
paper per year than papers not associated with links to data.
2.3 Data Visualization in Open Science
When talking about visualization it is important to recognize that this is a mature field with
extensive work and a multitude of applications. A broadly explored concept whose aspects are
commonly considered when looking to build upon developing ideas. This ever evolving field has
mutated and adapted to the advancements around it time and time. In this new context for open
science however the values of visualization are once again reiterated and expanded upon in light of
reproducibility. To understand this we must go over some fundamental values of visualization [50].
2.3.1 What is Data Visualization?
Data visualization has always played an important role in exploring, analyzing, and presenting
scientific data [52]. We can define it as a set of techniques used in order to create imagery that
aims to facilitate the communication of a concept or message, which can take the form of graphs,
diagrams, images, animations among others. Through mapping attributes to visual properties
like position, size, shape and colour, visualization designers leverage perceptual skill in order to
help discern and interpret patterns within the data, providing a powerful method to make sense of
data [22].
2.3 Data Visualization in Open Science 17
Figure 2.9: Value of visualization scheme proposed by [50].
2.3.2 The Value of Data Visualization
Finding value within a visualization should be an integral part of the scientific process. Pro-
ducing good visualizations should not just be another step in the scientific process, but a step
towards clearing the way for analysis and re-usage of data. Much progress has been made in stan-
dardizing processes towards creating good visualizations, the advances in graphics hardware have
been great, and many new methods, techniques, and systems have been developed [50].
Some models like the one in Figure 2.9 display how Data D through a specification S is trans-
lated to image I(t). From the users’ perception P, knowledge K is obtained over time in different
degrees depending on the user’s original knowledge. Many visualizations, especially in recent
years, push towards incorporating interactivity adding a new value of exploration E, allowing for
the user to adapt the specification in order to facilitate the knowledge retrieval process. This data
can be from spreadsheet data to the text of novels, but much of it can be represented as variables,
arrays or transformed (perhaps with loss of information) into this form. Then we must be able
to transform this data into useful graphical visualizations that allow ultimately the user to extract
appropriate conclusions [7]. This is especially important when we talk about scientific data, since
its sensitivity to any distortion caused by the transformation from raw data to processable data
while paired with an unappropriated viewing method can lead to bias [23].
A valuable visualization has a high knowledge and is able to deliver information easily and
extensively to the user, so it’s quite clear how this can be appealing to scientists publishing their
works, analysts trying to interpret data, etc. In [6], major ways in which cognition can be amplified
by visualization are proposed. Even though this work is somewhat dated, some of these proposals
still carry relevancy when analysing them against the new trends brought by web notebook tech-
nologies. The need to search for information in data is largely cut if good visualizations are in
place. Visual representations are vital ways to enhance the search for patterns in data, encoding
information in manipulable mediums and using perceptual attention mechanisms for monitoring
data. This is usually associated to visualization solutions that are greatly amplified by web note-
book’s abilities.
18 Literature Review
2.3.3 Data Visualization in Open Science
In order to understand how visualization fits in the context of Open Science we must also un-
derstand how it leverages some of the best aspects of the notebook. Some existing work shares
a very positive outlook on how notebooks can contribute towards open science [38, 54, 37]. In-
specting data is one of the drivers for researchers to read publications and share their own, and
many funding agencies and journals alike require the release of data as a condition for funding or
publication [38]. This shared data is usually paired with visualizations in order to take advantage
of the aspects we have seen. These figures are however typically rendered as static images, and
divorced from the underlying data, preventing readers from exploring them fully using tools like
segmentation or zooming on features of interest [37].
2.3.3.1 Visualization Solutions for Open Science
Visualizations plays a role in several applications concerned with advancing towards Open
Science. From data management platforms to notebooks themselves all of these aggregate the
possibility to pair data with description. Visualization empowers users with ways to explore and
interact with data streamlining engagement with scientific research [28]. This brings us back to
web notebooks as a tool for open science, through their ability to produce a rich media output
combining narrative text, static images, interactive visualizations, and code running over data dy-
namically [54]. The inclusion of interactive visualizations can be used as an indicator of maturity
of a scientific research dataset according to the FAIR principles [53]. One of the proposed methods
for the evaluation of level of reproducibility of scientific research is [28]:
• Level 0 – Scientific research with no associated Data or data under non-Open Access licens-
ing;
• Level 1 – Research data in Open Data but no bundled transformation steps;
• Level 2 – Research data in Open Data and bundled transformation steps executable on-
demand;
• Level 3 – Research data with Open Access with transformations and interactive visualiza-
tions.
Being aware of visualization’s incentives, and how visualization plays a major role within the
notebook format we should get familiar with state of the art visualization solutions. In a compar-
ison based on GitHub metrics seen in Figure 2.10, it is compelling to evaluate which visualiza-
tion libraries are most popular among users, as a means to understand which technologies would
be relevant for the integration with this work. It makes sense to analyze some of these popular
libraries in a more in-depth manner to further assess which would be essential to this implemen-
tation. Vega [42] and D3 [11] are well established visualization solutions that various notebook
2.3 Data Visualization in Open Science 19
Figure 2.10: A popularity comparison between some of the most used libraries in Notebooks [28].
platforms18 19 have adopted either through direct integration or widgets20. Other very well estab-
lished technologies like ggplot 21 or Matplotlib 22 might not have as many advanced features as
the aforementioned but have gained the trust of their user base through their fundamental features.
2.3.3.2 Vega
Vega and most recently Vega-Lite [42] are high-level grammars that enable rapid specification
of interactive data visualizations, using traditional graphical grammar characteristics such as al-
gebra composition and visual encoding, while combining these with a new approach to grammar
interactions. Vega-Lite combines a traditional grammar of graphics, providing visual encoding
rules and a composition algebra for layered and multi-view displays, with a novel grammar of in-
teraction. Vega sets to enable rapid ways to specify interactive visualization and to do some with a
simple and concise systematic enumeration while pushing towards the exploration of design vari-
ation. Vega is capable of rendering bar charts, line and area charts, circular charts, dot and scatter
plots, distributions, geometric graphs, tree diagrams, network diagrams, while including custom
visual design tools and several techniques [51].
Either by implementing ways to visualize airport connections in the U.S. (Figure 2.11), depict-
ing character appearances in a novel (Figure 2.12) or PI calculations (Figure 2.13), Vega provides
uncomplicated and lightweight solutions. On the other hand, notebooks’ versatility allows for
tools like these to be used with relative ease by research scientists.
2.3.3.3 D3
D3 is a JavaScript library for manipulating data based documents, allowing to bind arbitrary
data to a Document Object Model, and applying transformations to the document [11]. It reduces
overhead computation allowing greater graphical complexity at high frame rates, transition se-
quencing through events and providing a wide array of drawing options. Being a web oriented
18https://jupyter.org/about19https://beta.observablehq.com/20https://pypi.org/project/ipywidgets/21http://ggplot.yhathq.com/22https://matplotlib.org/
20 Literature Review
Figure 2.11: An interactive visualization of connections among major U.S. airports in 2008 [51].
Figure 2.12: This example depicts character co-occurrences in Victor Hugo’s Les Misérables [51].
2.3 Data Visualization in Open Science 21
Figure 2.13: Estimation of PI by randomly sampling points and counting how many of them fallinside or outside a unit circle [51].
tool, it doesn’t replace the browser’s toolbox. D3 allows for instance to still use CSS3 transitions.
Due to its characteristics, D3 is an excellent option for web oriented notebooks solutions being
lightweight whilst very powerful.
D3 powers a very wide range of websites and is widely used in web notebook platforms such
as Observable23. In Figure 2.14 we can see D3 being used to map the package hierarchy for a
visualization tool, and in Figure 2.15 used to represent animated changes in vote shift in the U.S.
elections.
2.3.3.4 Matplotlib
Matplotlib is a library written in Python limited to 2D plotting. This particular aspect may
just be one of the main appeals to many of its users making plot generation, histograms, scatter
plots, among others, a streamlined process achievable with a small number of lines of code. An
interface similar to MATLAB’s and its ability to be used within IPython notebooks contributes
towards making users feel at home and surely its popularity [25].
2.3.3.5 ggplot
ggplot is an interesting example allowing for an overview on some recent trends in the field
of notebooks and furthermore visualization. ggplot2 is a declarative grammar for the creation of
graphics, allowing users to provide data and choose a set of variables that define the aesthetics and
the desired type of visualization. ggplot on the other hand is a plotting system for Python based on
ggplot2 that seeks to bring this workflow to Python users [21]. Considering ggplot2 is a R language
exclusive library, the effort to port its functionalities over to Python and how quickly many users
picked up on the new Python compatible library showcases a trend. Jupyter notebook’s rise in
23https://beta.observablehq.com
22 Literature Review
Figure 2.14: The Flare visualization toolkit package hierarchy and imports [24].
Figure 2.15: U.S. counties vote shift [11].
Figure 2.16: An example showing the streamlined nature of matplotlib’s visualizations [25].
2.3 Data Visualization in Open Science 23
popularity sparked the python community into making efforts towards converting functionalities of
other incompatible libraries. This happens also in the opposite direction, with python only libraries
being converted to other languages. However the influx is not equivalent, bringing forward a clear
trend where Jupyter Notebook’s popularity alongside the popularity of Python as a programming
language make it a very preferable option when considering continuous support [28].
2.3.3.6 nbviewer
nbviewer is a web application focused on rendering notebooks as static HTML web pages.
This aims at providing users with a stable representation of the notebook that can easily be exam-
ined and shared with others. nbviewer is written in Python and JavaScript and uses nbconvert to
render the notebooks. Its an open source project much like Jupyter Notebook and affiliates.
nbviewer does not execute notebooks, it only renders the inputs and outputs saved in a note-
book document as a web page 24.
These tools provide notebooks with options for creating visualizations in a simplified way
with a wide array of features. They promote interactivity with the data at hand and extend the core
values the notebook has to offer. The incentive to look, interact and reiterate on scientific data
promotes the fundamentals of Open Data and increases reproducibility.
After this evaluation it became clear that supporting a wide array of libraries and providing
users with flexibility when deciding which tools to use in their work is crucial for the success of a
platform, so we took this into account when proceeding to the implementation.
24github.com/jupyter/nbviewer/blob/master/nbviewer/templates/faq.md
24 Literature Review
Chapter 3
Methodological Approach
When building a systematic and cohesive methodology to approach the implementation of
this dissertation’s work, we looked to partition the work into phases that would guide this effort.
Through this segmentation procedure we split the methodological approach into two main phases,
requirement gathering and implementation. The first phase is performed in order to carefully com-
pile a set of well defined requirements that should be fulfilled for this dissertation to be considered
a successful effort. The second phase will consist in the implementation of the specified require-
ments. In this chapter we will review the methodology used in this dissertation while thoroughly
examining all the carried out work.
3.1 Problem Formulation
In order to begin staging the implementation we first must look to clearly define what is the
issue that raises the necessity of this endeavor. After the contextualization done in the review of
the state of the art over the different aspects of open science, visualization and web notebooks we
can formulate a problem with far better insight.
Research work comprises more than publishing an article, paper or some of the other tradi-
tional outcomes. In order to support the validity of research work research often adopts comple-
mentary goals such as description and sharing of data.
Jupyter with its web platform allowing users to build shareable notebooks was a big step
forward for accessibility, allowing for sharing and reusing processing through the employment of
interactive visualization, text descriptions and code execution.
To bring this sort of workflow to “Dendro” platform, whose functionalities are similar to a
“private dropbox” oriented to scientific work groups that allow users to deposit and manage their
files. Adding the ability to annotate this data, manipulate and interact with it, would help bringing
the versatility needed to make Dendro a state of the art scientific data management platform.
25
26 Methodological Approach
3.2 Use Cases
To consolidate the problem formulation process we established a series of use cases as a way
to translate this dissertation’s goals into concrete coding objectives and to streamline the imple-
mentation of essential features. These are focused on the user interactions with the notebooks
within the platform, from creating to editing, downloading and searching for notebooks. The IBM
use-case template 1 was used as a guide for the use-case tables. These use cases, which are com-
prised by actors, pre-conditions and flow descriptions, are going to be user centered and assume
that users are registered and logged in. Finally we define an actor as anyone or anything perform-
ing a behaviour within the system, a pre-condition as an action that must happen before the case
runs, and basic flow the steps within an action in case nothing goes wrong [49].
Table 3.1: Use case table for the creation of a Notebook
Use Case 1 Creating a NotebookActor UserBasic Flow The user selects the option to create a new Notebook on the toolboxPrecondition User is currently in a respective project folder
Table 3.2: Use case table for launching a Notebook
Use Case 2 Launching a created NotebookActor UserBasic Flow The user launches the selected notebook from the respective projectPrecondition A Notebook must be previously created at the current user’s project
Table 3.3: Use case table for the execution of a Notebook
Use Case 3 Executing a NotebookActor UserBasic Flow The user runs the selected notebook executing its processing codePrecondition A Notebook must have been previously created
Table 3.4: Use case table for the download of a Notebook
Use Case 4 Downloading NotebooksActor UserBasic Flow The user presses the download button at the sidebar of the previewPrecondition User selects Notebook file
1https://www.ibm.com/support/knowledgecenter
3.3 Selecting Jupyter Notebook 27
Table 3.5: Use case table for the editing of metadata in a Notebook
Use Case 5 Editing Notebook metadataActor UserBasic Flow The user presses the edit button at the sidebar of the previewPrecondition User selects notebook file
Table 3.6: Use case table for the previewing of a Notebook
Use Case 6 Previewing a NotebookActor UserBasic Flow The user previews the Notebook.Precondition 1 The user selects a notebook file.Precondition 2 The notebook file contains code or visualizations over data
Table 3.7: Use case table for the search of a Notebook
Use Case 7 Searching for a NotebookActor UserBasic Flow The user inputs the keywords for the desired NotebookPrecondition 1 The Notebook files have been created previouslyPrecondition 2 The Notebook file is in a public repository
After defining these use cases the next step was to consolidate the technological basis and
understand how these newly added features would interact with Dendro, where it would be ap-
propriate to integrate this newly created structure, and how they would fit into the established
workflow.
3.3 Selecting Jupyter Notebook
Considering the previously defined use cases we can already start to see how the implemen-
tation will be divided into two components, the visualization of web notebooks outside of the
notebook environment and the integration of the notebook environment. One intersecting aspect
will be the selection of notebook technology to be implemented with Dendro. From the research
conducted in the state of the art, furthered in a scientific paper written in the scope of this disser-
tation, we concluded that Jupyter Notebook is currently the best alternative for the average user,
given its popularity and support, combined with broad support for powerful and high-level interac-
tive visualizations [28]. The decision to integrate more than one notebook solution could possibly
improve the appeal of the platform, for instance RStudio sees a large amount of popularity and is
not inherently compatible with Jupyter. However it is quickly expanding and becoming a norm
28 Methodological Approach
which other notebook technologies try to follow [28]. Integrating Jupyter Notebook technologies
with Dendro is the most feasible and flexible approach to this dissertation, combining flexibility
for the users as seen in table 2.1 and clear implementation procedure without requiring a major
change in the platform’s structure.
3.4 Jupyter Notebook
To understand how this implementation will be accommodated within the established structure
in Dendro we look to analyse how the two components of this implementation will integrate. In
order to do this we must first understand the structure and work flow of Dendro. To understand
where to correctly integrate Jupyter with Dendro we ought to focus on its main assignment, that is
to process and to catalog research data. The established system revolves largely around concentrat-
ing users in different research projects that function as data hubs, which is where the interactions
users have with data mostly take place. It makes sense to embed notebooks within projects, but
one of the main disadvantages would be that each notebook would be isolated and associated to
a project. However creating this association between the notebooks and projects will make the
notebooks more closely connected to each project and more easily shared between members of
this same project. This will allow an easier access to the research data by the notebook making it
easier to catalogue and document this data with processing code or visualizations.
3.5 Notebook Viewer
When considering where it would be most appropriate to integrate notebook visualization, the
inclusion within the projects component of Dendro seemed the most logical choice.
Similarly to the assets geared towards the creation and execution of notebooks discussed
above, it made sense to consolidate all of the features in one place. Since file creation and upload
features in Dendro converge around the project structure, the most reasonable approach would
be to implement the ability to create and save notebooks associated to a project. In this manner
users would be able to create a project, run a notebook environment where they would be able
to access such project data and have all the usual notebook execution capabilities. In the already
existing project view, an element was added in the navigation list where the notebook visualiza-
tion is displayed. Initially this display was designed as a rendering tool only capable of rendering
the notebooks created within the project. However, the idea of being able to render python note-
books was an achievable concept and thus the scope of the implementation changed in order to
accommodate notebook rendering.
3.6 Iterative Implementation
One final step before proceeding with the implementation was to create an implementation
plan that could accommodate direct feedback from possible users as a way to shape the develop-
3.7 Implementation Strategy 29
ment. Providing an appropriate response to the requirements of the established user base of the
platform was a major focus during this implementation. To understand how impactful the pro-
posed solution will be, a series of interviews with researchers were planned. The intention was
to find an expert user, whose experience with Jupyter Notebook was significant and could give
insight on the typical workflow and expectations a platform such as Dendro should fulfill.
The purpose of these interviews, besides assessing expectations, was also to guide the im-
plementation in a iterative way that would meet the expectations of users through a series of ad-
justments over time resulting from feedback gathered in these interviews. Initial interviews were
intended to identify the desired features and perform adjustments. Later interviews, besides updat-
ing our expert user with the development, would also help evaluate the effectiveness and overall
impact of the implementation on their initial vision and workflow.
In order to conduct these interviews it was necessary to carefully select our expert user, create
an interview script and extract appropriate conclusions.
3.6.1 Expert User Selection
When selecting a candidate for this process the focus was on researchers familiarized with
both Dendro and Notebook technologies. However it became evident early on that finding such
a profile would be unlikely. The still limited scope of Dendro in comparison with other scientific
data management platforms alongside made it harder to fit these requirements. The decision was
made to prioritize experience with Notebooks over familiarity with Dendro. Search began within
research teams in close proximity with this dissertation’s work group, collaborators of INESC
TEC and fellow colleagues. This is where we first came across a researcher currently using Jupyter
notebooks for research work comprising large satellite data analysis. The search for other subjects
did continue for a period of time, but due to the difficulty in matching the defined criteria, an
interview plan was setup and contact with our already established expert user was established.
3.6.2 Interview Structure
The structure under which these interviews would occur was organized as a three phase pro-
cess. First, a series of emails were sent to establish the level of familiarity of the subjects with
Dendro’s workflow and Jupyter Notebooks. Followed by a personal interview in order to better
interpret the needs and expectations of users of these technologies. And lastly, a final meeting in
order to obtain some evaluation metrics on how the obtained solution would affect the established
workflow, possible improvements and the level of satisfaction with the implementation.
3.7 Implementation Strategy
After gathering and considering these requirements we could have a very rough view on how
the implementation would pan out. It quickly became clear that the work would be focused on the
implementation of two key components.
30 Methodological Approach
These would provide the basic structures for the integration with Dendro. At its core we looked
to implement a structure that would allow users the ability to execute instances of Jupyter Note-
book capable of processing, managing data and creating visualizations or annotations within each
repository. And to complement this we also have to add another structure capable of displaying
notebooks annotated with visualizations ideally through commonly used libraries like nbviewer2.
The implementation strategy was to begin the work with the integration of Jupyter Notebooks
followed by the implementation of a structure to render notebooks. In this section we will go into
detail over the two main implementation steps that would allow the fulfilment of the projected
use cases — integration of Jupyter Notebooks and ability to render notebooks in the “Dendro
platform”. We will also discuss how the interviews with our expert user influenced this implemen-
tation in both key aspects.
3.8 Expert User Interview
In this section, and before getting into the description of the implementation, we will explore
in which way the interview with the chosen subject influenced the implementation through an
analysis of the interview itself and the outcomes taken from this effort.
3.8.1 Interview with Expert User
After making contact through initial emails it was settled that the selected researcher was us-
ing Jupyter Notebook to capture process of handling large chunks of data, attempting to enhance
reproducibility. In order to conduct the interview in an efficient manner and extract useful deduc-
tions towards the implementation, it was very important to define clear topics of discussion and
objectives.
Therefore, according to the obtained knowledge through the initial exchange, discussion guide-
lines for this interview were centered around three main concerns. The routine workflow, associ-
ated systematic problems and personal suggestions for improvements by the interviewee. Note-
books were being used as a gateway to operate over data while preserving the methodology for
re-utilization. Since most users will look to take advantage of the most basic features of Jupyter
Notebooks, this can be seen as a very common usage. When inquiring about the usage of note-
book’s visualization potential the interviewee mentioned only the annotation of notebooks with
simple visualizations provided by the nbviewer library.
Considering these use cases, the researcher was then asked about the difficulties and limitations
within their workflow. Issues became apparent when analysing the data structure with which this
researcher was working. Firstly the notebook was being used only to interact with data and storing
the handling process and had no interactions with the Dendro file structure. This led to a tedious
process which required extensive lookup in an vast repository with no control over its file structure.
Some performance issues were also reported, in part due to the dimensions of the manipulated
2https://github.com/jupyter/nbviewer
3.8 Expert User Interview 31
files, the notebook provider’s platform limitations and the fact that segmentation of the files was
not possible.
Finally the way the current platform was set up didn’t allow the researcher to share her work
in a way that could make possible for it to be expanded upon by other researchers, which was in
large part because it wasn’t aimed at reproducibility from the start. Due to concerns with privacy
and the integrity of the notebooks and with no current solution, aspects of reproducibility were
greatly disregarded.
Lastly the conversation went over some suggestions on how to improve some aspects of the
designed implementation with a focus on employing visualization. These suggestions in combi-
nation with an analysis of the conversation lead to a series of deductions on how to better adapt
the work in development as expected.
3.8.2 Extracted Deductions
The conclusions taken from this interview revolved around three main subjects: visualization
requirements, performance evaluation and Jupyter use cases.
Concerning visualization, it was emphasized that methods regarding its use were field specific
and could range from crucial to redundant. Users in some fields may find it strictly necessary
to accompany their annotations and work with visualizations ranging from the simple to the very
complex, while other users may not find it necessary at all. In the particular case of the interviewed
researcher, visualizations were used to explore data but visualization support from libraries such as
D3.js3 weren’t strictly necessary. What we took from this was that a convoluted solution wouldn’t
be ideal. While having the option for deploying visualization libraries in a modular fashion would
be ideal, it should be considered that a big portion of users might not need these features by default.
Regarding performance, this researcher in particular considered it to be a big case for concern.
While concentrating notebook services in online platforms it is important to think of performance
as key since daily users of the platform will have to deal with these aspects for extended periods of
time. This subject in particular has faced issues with execution times in notebooks and stuttering
issues. The visualization previews being used at the time were causing the researcher to loose
sometimes up to hours trying to work around these issues. Having already experimented with
different web browsers, libraries and methods, the researcher felt stranded due to the immutable
structure of the current platform.
Ultimately, in this particular case, Jupyter was mostly being used as a way to process data for
analysis and as a tool for the research document processes for personal use, often not having the
detail required for others to interpret them independently. As previously mentioned the use cases
for the implementation will be varied and using the notebook as just a showcase for scientific work
won’t be enough to fulfill the needs of the various users. It will be important to also bridge the gap
between the notebook as a tool for scientific research while keeping the efficiency of its all-around
functionalities.
3https://d3js.org/
32 Methodological Approach
3.9 Jupyter Notebook Integration
When starting the implementation of this notebook solution one of the first steps was to study
where the integration of a notebook into “Dendro platform” would best be suited. When analysing
the structure of the platform it became clear that centering features around the project structure
would be the most elegant solution. Furthermore, being aware of the typical workflow is also in-
dispensable, as the interaction users maintain with the platform mustn’t suffer significant changes
with the inclusion of a notebook service.
The integration of Jupyter Notebooks in this solution can be seen as a three step process build-
ing up to a fully implemented notebook service. Firstly, the usage of a docker image used to launch
the notebook to be distributed by the users through a reverse proxy structure. Then, adapting Den-
dro to work as a reverse proxy between the user and the running Jupyter images. And finally, a
synchronization process keeping consistency between the project file structure in Dendro and the
notebook’s file structure.
3.9.1 Jupyter Notebook Docker Structure
The first step was to chose a Jupyter Notebook image that would suit our needs. We looked to
integrate the most stable and continuously supported image, so the chosen docker image was the
popular and frequently updated scipy-notebook 4.
In order to understand the implementation we have to have a basic understanding of docker [16]
and also understand the logic behind docker deployment within Dendro.
Docker enables the separation of applications from the core infrastructure of Dendro. This
way software can be packaged and run in a isolated environment, and this environment is called
a container. This isolation allows many containers to be run simultaneously on a given host.
These lightweight containers run directly within the host machine’s kernel. Each of the running
notebooks will be a container of the scipy-notebook running in its own individual instance. These
images can be started, stopped, move or deleted [15].
Before proceeding, it is imperative to make some distinctions between some core elements in
this work’s infrastructure. Images, containers, networks, volumes and orchestras are going to be
discussed throughout this section and their definition should be explored deeper.
A Docker Image is essentially a file composed by layers representing instructions for the
image that is used to execute code in a Docker container. It is built from the instructions necessary
for a complete and executable version of an application. One image can be ran by Docker multiple
times creating multiple instances of the respective container [27].
A Docker Container distinguishes itself from the image mostly through its writable top layer.
This layer is capable of storing all the modifications over existing data within the container. This
layer is unique to each container and leaves the underlying image unchanged [27]. This is advan-
tageous for our implementation since it allows for every container to be based of the same image
while still having unique data state.
4https://hub.docker.com/r/jupyter/scipy-notebook
3.9 Jupyter Notebook Integration 33
Figure 3.1: Implementation Scheme Representing Docker Structure Within the Integration
A Docker Volume is a directory in the Docker host’s file system that is accordingly mounted
directly into a container. These volumes are not controlled by the storage drive since they directly
bypass it and operate at native host speeds. Multiple volumes can be mounted into one container
and on the other hand can also be shared. In this implementation each notebook container is
assigned one and only one unique volume [27].
A Docker Network is created by default in any docker host, while then automatically adding
containers to it. This creates a subnet that allows containers to run within a network, and these
containers can run on an isolated network (for instance for security reasons) or can run within the
same network allowing the containers within the network to communicate with one another [15].
Finally an Orchestra can be defined as a series of micro-services, which in this case are the
different containers running within each of the existing orchestrations, with a common goal. They
work in a logical way to provide a service bigger than each of the individual micro-services [36].
To understand how these aspects come together in this implementation we can look at Fig-
ure 3.1, where we can see the images, containers, volumes, networks and orchestras implied in
this implementation. The images “scipy-notebook” and “Traefik” were added in this implemen-
tation, while the images “Virtuoso” and “mySQL” are representing some of the already existing
images in Dendro. The gray blocks represent the orchestras.
34 Methodological Approach
Firstly we should consider the existing infrastructure. In Figure 3.1 we see how containers are
created based on the images and are orchestrated into the “Dendro Orchestra”. Since no network
was specified in the respective docker-compose file all of these containers are added to the default
network, this can be seen in 3.
To implement the notebook service a new separate infrastructure was implemented. Since
Dendro is responsible for orchestrating both of these structures, the reason why separated orches-
tras and networks were created was based on two aspects. These infrastructures were fundamen-
tally providing a different service. We look to orchestrate the notebook services in a way that
allows to start, stop, delete and deliver distinct notebooks amongst the users. There is also a ne-
cessity created by the routing requisites of the implementation for the different notebooks and the
routing entity to communicate with each other. This led to the creating of a separate network to
accommodate the notebooks.
Referring back to Figure 3.1, in 2 we identify the newly created notebook network, any con-
tainer created based on the “scipy-notebook” image is added to this network alongside the proxy
micro-service “Traefik”. This will allow for these containers to communicate with each other,
which in this case means the proxy service will know about the existence of any running Jupyter
notebook container and will therefore be findable by the proxy.
In 1 we can see each of the notebook orchestras, linked to the notebook network and respec-
tively containing a volume for the data within the notebook. These notebook orchestras will only
be launched by Dendro if a notebook is created or activated, and can go up to any number of
instances. The volume, mounted on the host’s file system, will contain a data folder where the
corresponding data stored in the notebook container will be maintained.
Before proceeding with an overview of the behavior of the proxy micro-service there is one
final aspect, within this infrastructure, we must analyse. In Figure 3.2 we have a scheme for the
naming structure of the containers and volumes within the system.
Each container is assigned a randomly generated ID upon creation, the corresponding volume
on the host system is then mounted within a folder called “jupyter-notebooks”. This folder con-
tains various volume folders that will share their name with the ID of the respective container as
represented in Figure 3.2. These IDs are stored in Dendro at the time of creation for each note-
book, meaning each notebook shares and identical volume name, container name and object name.
This will be a fundamental aspect of the synchronization and redirection aspects of the notebook
infrastructure.
3.9.2 Reverse Proxy Implementation
The need to correctly redirect the users requests regarding notebooks to the correct notebook
container led to the implementation of a system that allows Dendro to operate mostly as a proxy for
these requests. To make sure every user was correctly linked to the right container and to maintain
access permissions across the platform, Dendro is required to play the role of a mediator of all
requests. Therefore a reverse proxy architecture was implemented. These sort of architectures are
3.9 Jupyter Notebook Integration 35
Figure 3.2: Container and Volume naming structure
usually utilized in order to achieve high availability and distribution and because of this it seemed
the ideal solution for this issue.
To accomplish this, the networking service Traefik was utilized in order to simplify some
of the complexity associated with creating this sort of networks. Instead of having to configure
each route connecting to different paths or sub domains for each container, Traefik is used to
automatically generate the routes for the containers connecting them in a simplified manner [46].
As previously discussed both Traefik and the notebook containers share the same network,
which is the first requisite for the normal behaviour of Traefik. Secondly, the names of the con-
tainers are consistent as seen in Figure 3.2, which means we know internally the value of each
notebook’s ID, that furthermore coincides with the container name satisfying Traefik’s second
requisite. Being able to communicate with each of the containers, by sharing their network, and
correctly addressing each notebook by its container ID, we have assembled the conditions for
Traefik to correctly generate routes for the requests. Now to explore the architecture based on
these features we can look at Figure 3.3.
Figure 3.3 at its core represents a request and response scenario for this new architecture. A
request is made by the user regarding any existing space within the notebook. The location of the
resource within the notebook is represented in the URL in gray, while the representative GUID
(unique designation for each container) is represented in red. When receiving this request Dendro
internally modifies it as seen in 1, the address is modified in order to be processed by Traefik, and
36 Methodological Approach
Figure 3.3: Reverse Proxy Diagram Structure
the “Host” is altered in accordance to the GUID for the target notebook. With these two steps
Dendro “stamps” the request with the respective “Host” information changing it from the original
“Host” of the request (that would be Dendro) to the “Host” GUID value corresponding to the target
container running in the server. This modifications allow for the request to be redirected towards
Traefik with this information, which in its turn will find the desired container based on the newly
modified and respective “Host”.
Furthermore, we can see in 2 the target resource in gray is not altered by this process. Traefik
will parse the request to the specified container, without changing the URL allowing to address
the correct resource.
From the perspective of the user, however, this process is seamless as seen in 3: the URL
remains unchanged and the user is unaware of the proxy process.
This allows for the same container (Jupyter notebook) to be accessed by any user that attempts
to connect to it correctly. This means that any user that has permission to access the container
(either by being a collaborator of the project, its creator or in the case of a public notebook), can
have simultaneous access to the same instance of the executing container.
3.9.3 Synchronization and Data Management
With users using the notebook service to successfully access running Jupyter Notebook con-
tainers there is one fundamental aspect of the notebook service for consideration. Managing the
data within the notebook, ensuring its availability to all users and maintaining the integrity within
the different file systems. Each container encompasses its own file system, different from the one
in storage, therefore this structure must be replicated in order to share the same virtues as other
data objects in Dendro such as the ability to add metadata.
Users should be able to start a notebook and modify its file structure either by creating files and
folders or uploading their own. The opposite should also be valid, allowing for existing notebooks’
data to be persistent over time and consistent with the data existing in the repository.
3.9 Jupyter Notebook Integration 37
There are three main systems in place in order to bring this structure together. NotebookCreation which is the process evoked when a user creates a new Jupyter notebook within the
platform. Notebook Restore can be evoked by an user attempting to access a running notebook
container or trying to launch an existing one — it can be most easily summarized as an attempt
to access a notebook that has been previously created. Finally the last system is the NotebookMonitor Job, a periodic process that looks for changes within the existing notebook volumes and
is responsible for updating the respective notebook objects in storage with the incoming data.
When creating a notebook, the Jupyter Notebook’s volume (its internal file system) is defined
by a parameter set on container generation, which will be the root location for any file or folder
created within that notebook. Each notebook behaves much like a folder containing all of the data
belonging to the notebook, including ipynb files themselves. The concept of notebook within
the platform won’t represent the traditional notebook ipynb executable file but will instead work
much as an object containing all of the notebook’s structure files and data. Making each notebook
a fully independent and isolated entity that can be shared and exported.
When restoring a notebook there is a series of checks that must be performed, which are
centered around two main components necessary for the correct execution of the notebook service.
These are the current state of the docker container (running or stopped), and the existence of the
volume folder in the host machine. This process will then be in many ways similar to the creation
of a notebook for the first time, and we can overview this in full effect in Figure 3.4.
Before analyzing Figure 3.4 in depth, the decision states for which an user can restore a note-
book should be clarified. There can be four possible scenarios related to the container and volume
state:
• The most straightforward scenario would be a running notebook container and an exist-
ing volume within the host machine, this means a Jupyter notebook is active and the user
accessing it just needs to be correctly redirected.
• The second scenario would be a stopped notebook container while the data volume exists in
the host machine folder, this can be caused by a crash in the container, for instance, and the
container is then restarted.
• Another scenario would be an active notebook container but no data volume in the host
machine, this could be caused by an error or a platform crash when restoring the Jupyter
notebook object data, the volume should be restored and the container restarted.
• Finally, the last scenario would be the default scenario for an existing notebook in the plat-
form, the container is stopped and the volume data is not mounted on the host machine.
In Figure 3.4 the processing flow of the create and restore operations can be seen. After the
user selects the option to create a new notebook, the current folder within the project structure
will become this notebook’s parent folder and the process of starting the orchestra will begin.
Concurrently, a volume is mounted on the local system and the notebook’s ID and details are saved
38 Methodological Approach
Figure 3.4: Flow Chart of Notebook Creation and Restoration Systems
in Dendro storage; once these operations are complete, Dendro will access if the container is ready
and will redirect the user to the Jupyter Notebook splash page. When choosing to activate/restore
an existing notebook, the process will be somewhat similar, with the addition of the corresponding
checks for the status of the container and volume. After these checks, the process will be merged
at the respective phase as seen in Figure 3.4. The processes in the gray boxes are exclusive to the
restore functions.
The final aspect of this system is the monitoring job process. Jobs are a series of processes
that are executed by the server running some process independent from the user. Using cron
syntax to define the interval of time to which the “Notebook Monitor Job” would run, which is
currently set to run every five minutes, this job is then responsible for comparing the value of
the last modification within each of the running notebooks with the last modification value in the
repository for the respective notebook. It will then proceed to upload the data from the notebook
to the repository accordingly.
In Figure 3.5 we notice the workflow of the job, launching on Dendro startup, which asyn-
chronously performs the checks for the modification values and then performs the comparison
for each of the notebooks. As, previously described, if the modifications in the volume are more
recent than the ones in the repository the data is uploaded. However if the modifications in the
volume are not more recent than the ones stored in Dendro this means that the modification values
are the same and the current content is already up to date, which means we are not free to shut
3.10 Jupyter Notebook Viewer 39
Figure 3.5: Notebook Monitor Job schema
the container down and clean up the volume folder as seen in Figure 3.5. This means that each
notebook has a current timeout value of five minutes and after that the container will be shutdown
and will have to be restarted. This guarantees that the server is not overloaded with an excessive
number of notebook containers running simultaneously and data is uploaded to the repository at
reasonable rate. These values can modified and adjusted accordingly.
3.10 Jupyter Notebook Viewer
In order to convey more versatility to this web notebook service within the platform, it is
fundamental to have the ability to quickly view notebooks. These previews should be available
to the notebook creators, the work group and or publicly if the owner decides so. The central
focus when deciding the aspects of this visualization method was which technology to deploy.
The rendering of the notebooks on these pages seemed only logical to be done with nbviewer5, as
it is one of the preferred technologies for fast notebook rendering.
As seen in our structure model, the most logical way to integrate notebook previewing should
be within the context of the project allowing for users to preview a selected a notebook by explor-
ing a notebook folder.
5https://github.com/jupyter/nbviewer
40 Methodological Approach
Even though previously studied in the state of the art analysis, it is once again interesting
to scrutinize the way nbviewer was implemented in this solution since it uses a different logic
from the original nbviewer. The current version of nbviewer6 provides rendering functionality to
a series of notebook technologies, when provided a repository link containing a notebook. The
local running version of nbviewer provided by Jupyter adds more features and higher versatility
since its functionalities can easily be extended. The implemented version is based on a client side
rendering version7 of nbviewer. The reason why the work was based on this implementation was
because of its lightweight nature and possibility to render a notebook without any running instance
of Jupyter Notebook.
This action was implemented in order to increase accessibility and reduce time spent in order
to load different notebooks. The rendering process of the notebooks is very fast, since it is fully
written in JavaScript while deploying marked.js8 for markdown rendering and Prism.js9 for syntax
highlighting.
Finally, when implementing this rendering feature, the initial scope was to create previews for
notebooks created within the project, however this feature also extends as a standalone notebook
visualizer. This will allow researchers within the platform to more easily explore other notebooks
and bring benefits for users using Dendro as a tool in their daily workflow, significantly increasing
this implementation’s appeal to a diverse set of Jupyter notebook users.
3.11 Difficulties and Obstacles
When implementing the features in this dissertation, some aspects proved to be significant
obstacles. The first and most prominent was the concurrent computation required to operate a
large number of notebook containers within Dendro, which was mitigated by usage of docker
images.
However, this obstacle returned when discussing the synchronization of the notebook file sys-
tems and the data in the repository. The initial planned structure involved a number of watchers
over each notebook file system responsible for notifying Dendro of changes within the notebook,
but it became apparent that such a dynamic would cause a severe overload of the database several
notebooks running within the platform.
This structure ended up being abandoned and replaced by a job that itself raised some issues
at times because its periodicity could cause some data to be lost, and therefore some attention was
required and several parameters had to be added to the notebook structure in the database in order
to guarantee that it maintains its integrity.
6https://nbviewer.jupyter.org/7https://github.com/kokes/nbviewer.js8https://github.com/markedjs/marked9https://prismjs.com/
3.12 Digressions from the initial implementation 41
3.12 Digressions from the initial implementation
From the described use cases that set the goals for this initial implementation plan, a standout
feature that did not come to fruition was the ability to search for notebooks within the platform.
Since the structure chosen for the notebooks and the way they integrate with Dendro requires the
notebook to be contained within a project and retain some structural integrity because of its asso-
ciated data, it became complicated to implement a searching system that would be exclusive for
notebooks. Users however still have the possibility to share these notebooks within the platform,
either directly with their contributors on the project, or by sharing the notebook link guarantee-
ing that the notebook is contained within a public project. And even though a specific notebook
search functionality has not been implemented, it is important to mention that due to the notebook
contents synchronization with Dendro’s file structure. This allows the contents of the notebook to
be annotated and described having their metadata indexed by Dendro.
42 Methodological Approach
Chapter 4
Outcomes and Experiments
In this chapter we describe the outcomes of the implementation and the experimental process.
We will showcase the core implemented features individually. We will also document the exper-
imental process and reflect upon the obtained outcomes extracting considerations relative to the
success of the implemented work.
4.1 Implementation Outcomes
Firstly we provide a description of the implemented features alongside a showcase of the inter-
actions users can have with the dissertation work. We will match and compare the accomplished
features of this work with the initially outlined aspects. So that later we can also evaluate their
implementation through an experimental process.
We divide the implementation outcomes in two main components, the integration of the Jupyter
Notebook service and the notebook viewer component.
4.1.1 Jupyter Notebook Integration
In this section we explore the outcomes of this implementation from the standpoint of the
user. Based on our use cases, users should be able to create, launch, execute, download and share
their notebooks. Since the notebook object structure inherits a lot of properties from existing
elements in Dendro, such as Folders, qualities like sharing and downloading are accessible under
the previously established methodologies. However, actions like creating a notebook, starting or
restoring a notebook and executing actions within the notebook required a totally new interface
and batch of methods.
Firstly we begin by looking at how users approach the creating a notebook action, in Fig-
ure 4.1. The user navigates to any existing project, and after creating a folder the user can select
it by left clicking on it. If the folder is selected the user can select the “Create Notebook” option
from the drop down menu or it can bypass this by simply left clicking on the folder and equally
selecting the “Create Notebook” option. This can be done for any folder except the root folder
43
44 Outcomes and Experiments
of the project in order to maintain the consistency of the repository between projects that con-
tain notebooks or not. Also, a notebook object must always be created within a folder. The user
interface for the described action can be seen in detail in Figure 4.1 as previously mentioned.
When a user looks to start a previously created notebook, the action is very similar to the
one described for the creation of the notebook. However, when selecting the drop down menu or
right clicking the notebook object like previously, the prompt will be different, giving users the
options to “Start Notebook” instead, as shown in 4.2.
After both scenarios of creating or starting an existing notebook, the users will be redirected
to Jupyter notebook object splash page where they have a series of controls provided by the
Jupyter container, with options to upload files, create new folders, create new files, among others.
In Figure 4.3 we can see an example of this, where two files were uploaded, but these files could
also have been created by the user within the Jupyter interface. The ipynb file seen in the figure
is the typically recognized Jupyter notebook file, that in this example explores the “Lorenz system
of differential equations”.
Furthermore, we can see how the active kernel of the notebook is capable of executing thenotebook in real time, alongside any imported libraries to the notebook object, allowing it to
render in real-time a representation of these differential equations which the user can explore
and execute at will, as seen in Figure 4.4. This demonstrates the capacities the notebook has to
showcase research data in real time and perform methods of data processing within the platform.
This completes the showcase of possibilities the Jupyter Notebook framework makes available
to users. However with Dendro we can dig further into the platform possibilities since through the
navigation tools in Dendro we can explore the files within the notebook, as seen in Figure 4.5 and
combine the notebook file system with traditional metadata records. In Figure 4.5 we can see the
existing files within the previously created notebook, and perform any of the platform established
functionalities on them.
Since each notebook works as an isolated system, the possibilities for the libraries and func-
tions each notebook can perform are limitless and allow users to customize their notebook to their
needs. One of the advantages of having a centralized notebook service system serving containers
to different users is that any dependencies or imports necessary for the correct performance of the
notebook won’t have to be replicated by other researchers interested in exploring that notebook’s
work. Since each notebook and its associated structure is saved and replicated (imperceptibly to
the user) on notebook start. This brings one of the biggest advantages of this system by allowing
the fast and simplified reproduction of methods.
4.1.2 Notebook Viewer
Finally, on the notebook viewer section of the implemented work, the goal was to integrate a
way of previewing notebooks within the project structure. As it can be seen in Figure 4.6, a section
within the central panel in the project explorer menu called “Notebook View” allows for the quick
preview structure to generate a static version of the notebook, allowing users to understand if such
notebook is of interest.
4.1 Implementation Outcomes 45
Figure 4.1: Notebook Creation Flow Chart.
46 Outcomes and Experiments
Figure 4.2: Notebook Start/Restore Flow Chart.
4.1 Implementation Outcomes 47
Figure 4.3: Active Notebook Interface.
48 Outcomes and Experiments
Figure 4.4: Interaction With Active Notebook Execution.
4.2 Experimental Outcomes 49
Figure 4.5: Using Dendro to Explore the Notebook File System.
This section also allows for external notebooks to be imported and rendered client-side by
a drag and drop functionality that allows users to simply drop their ipynb files on the viewer
section that will proceed to render the notebook. This lightweight method allows more flexibility
for researchers, in case they need to manage several notebooks and need a quick way to preview
them without uploading them to the platform, and an example of this can be seen in Figure 4.7
4.2 Experimental Outcomes
In this section we will be analyzing the methodology and results obtained from the experi-
mental process targeting the developed work. In order to obtain insightful conclusions on both
the usability and relevance of the implemented features we looked to perform experiments that
would provide results on both the system usability and the usefulness of the implemented meth-
ods. To that end, the experimental procedure focused mostly on establishing a usability test plan
that would guide participants through a series of tasks concluding in a system usability survey and
an interview in order to assess the subjects’ satisfaction with the work implemented. We will now
analyze this procedure from initial test planning and subject selection, experiment execution and
finally assessing the outcomes.
4.2.1 Usability Test Plan
This usability test plan was elaborated largely based on the proposed approach by Usabil-
ity.gov, which served as a guideline for this section [5, 44]. Usability.gov is the leading resource for
user experience and best practice guidelines, provided by the United States Department of Health
and Human Services, this website overviews user centered design process and covers methodolo-
gies and tools in order to make digital content more useful and usable [44].
50 Outcomes and Experiments
Figure 4.6: Notebook Viewer Static View.
Figure 4.7: Notebook Viewer Rendering Imported Notebook View.
4.2 Experimental Outcomes 51
We carried out the usability test in order to establish a baseline of user performance, validate
user performance measures and identifying potential design flaws to be addressed in order to
improve the efficiency, productivity and end-user satisfaction [48].
In order to establish a baseline for user performance and satisfaction levels, a series of tasks
to be performed by representative testing subjects was envisioned. These tasks would help de-
termine design inconsistencies and usability problems within the user interface and content areas.
Throughout these tasks these problems are categorized into critical and non-critical errors. Critical
errors are defined as a deviation at the completion level from the targets of the scenario, where the
users may or may not be aware that the task goal is incorrect or incomplete [48]. Non-critical
errors are divided into three potential types of errors as categorized by [48]:
• Navigation errors – failure to locate functions, excessive keystrokes to complete a function,
failure to follow recommended screen flow.
• Presentation errors – failure to locate and properly act upon desired information in screens,
selection errors due to labeling ambiguities.
• Control usage problems – improper toolbar or entry field usage.
The data collected from these tasks was used to assess whether the implemented work achieves
an efficient and effective user interface.
When looking to select a group of users to perform the evaluation tasks aimed at assessing
usability, we looked to choose at least five participants. The reason from this number comes from
the assumption that elaborate usability tests often are associated with a waste of resources. The
best results come from testing with no more than five users alongside the execution of as many
small testing sessions as possible [33]. To perform these testing sessions we looked to define what
would be a representative end user of the platform. We characterize this user group as Dendro users
whose work deals with data management and processing. Users with an interest in making their
processing methods more easily accessible to contributors or research partners. Hence, the profile
of a representative subject for this experiment would be a user that holds experience working
with web notebooks and data management platforms, notably subjects with an interest in sharing
methodologies applied in their scientific work with others. We can also consider subjects that have
experience only in one of those categories, since we must account for convenience sampling, given
the still somewhat limited usage of web notebooks within scientific research communities in close
proximity.
The testing occured at a predefined room in which a computer running the implemented code
would be available for test subjects to carry out the evaluation tasks. The subjects were assisted
by a facilitator that collected the test data performing the role of data logger for the experience.
Finally, subjects were asked to fill two forms: one relative to the usability features of the system
and another relative to the overall satisfaction with the implemented work.
Each testing session was planned to last twenty to thirty minutes and the initial period for
testing would be a time frame of two weeks within the first half of February 2020.
52 Outcomes and Experiments
4.2.2 Experimental Sessions
The experimental procedure sessions took place in the Faculty of Engineering of the University
of Porto (FEUP), counting with the participation of members of Information Systems Laboratory
(InfoLab), a group based at the Department of Informatics Engineering at FEUP that brings to-
gether teachers and researchers in areas such as Information Management, Information Retrieval
and Information Systems [20].
Eight participants were interviewed and asked to perform a total of five tasks in the most effi-
cient and timely manner in order to later provide feedback regarding the usability and acceptability
of the user interface alongside some impressions of the implemented work impact.
Participants were chosen based on their similarity to our end user group, through contacts
with research partners, platforms users and collaborators. Users with no experience working with
the platform wouldn’t be desirable. In this manner and through convenience sampling the selected
subjects were a group of associates of InfoLab or subjects in proximity with them, that are actively
working within research groups and generally involved in work either with data processing, data
management and computational notebooks.
Before beginning, each subject was provided with a short overview of the platform and some
guidelines about the user interface or workflow necessary for the execution of the tasks but not
directly related to the test, alongside a document describing the procedure which can be analyzed
in Appendix C. Participants where then asked to examine a list of tasks and perform them in order
using only the provided computer and an instance of the Dendro website. The subject interaction
with the website was monitored by a facilitator, in this case the author of this dissertation, that
was also responsible for cataloging the time taken to perform each of the tasks and monitored the
session.
Based on the core features of the implementation and the formulated use cases, participants
were asked to perform tasks that would cover all of these aspects of the implemented work while
enveloping the common workflow of the platform. The tasks are as follows:
• Creation of a Notebook
• Upload/Creation of Notebook Files
• Executing the Notebook
• Sharing and Starting existing Notebook
• Previewing an External Notebook
These tasks contained simple instructions and should be completely explicit for the users al-
lowing them to perform the tasks without external interference. They are detailed below.
Creation of a Notebook: The goal of this tasks is the creation of a notebook. In this task you
will begin at the splash page for the website. An account has been previously created alongside
different projects within the platforms, you are already logged in. Your first task is to navigate
4.2 Experimental Outcomes 53
to the view containing the listing of available projects in the platform. You must then select the
default project, create a new folder and proceed to create a notebook.
Upload/Creation of Notebook Files: The goal of this task is to create a folder structure and
upload files through the notebook. In this task you will begin at the Jupyter Notebook splash page
of a previously created project containing a notebook folder. You will then proceed to use the
notebook file system to create two distinct folders and upload one or more designated files from
the computer to either of these folders. Make sure both files are uploaded on the same location:
-lorenz.py -lorenz.ipynb
Executing the Notebook: The goal of this task is to exemplify the typical workflow for data
processing and annotation in the implemented work. In this task you will begin at the notebook
folder. In order to save time initialize the “lorenz.py” file where any code changes can be made.
Launch the “lorenz.ipynb” notebook, this would be the typical processing and visualization note-
book. You are now free to run the code at will, and explore the parameterization. Finally you must
copy the address link for the notebook in question, save and close the notebook.
Sharing and Starting existing Notebook: The goal of this task is to emulate the sharing pro-
cess of a notebook. In this task you will begin at the splash page, where you should logout of the
current user. You should login with the credentials: user:demouser2 password:demouserpassword2015
This user has been previously added as a project collaborator. You will then proceed to navigate
to the previously created notebook and check if the notebook is according to the one previously
explored. Finally you must paste the link address in the navigation bar and assess if the URL
redirection is functioning correctly.
Previewing an External Notebook: The goal of this task is to access the preview display of
a notebook. In this task you will begin at the splash page for the website. An account has been
previously created alongside different projects within the platforms, you are already logged in.
Your first task is to navigate to the view containing the listing of available projects in the platform.
You must then select the previously uploaded “lorenz.ipynb” from its local version and navigate
to the notebook view tab. Then you should select a notebook file from the computer and drop it in
the notebook view tab rendering its preview.
The first two tasks “Creation of a Notebook” and “Upload/Creation of Notebook Files” look to
establish a foothold on the usability of the implemented service. While ‘Executing the Notebook”
looks to let participants familiarize with the full workflow for the creation of the most basic and
typical notebook user case. “Sharing and Starting existing Notebook” evaluates the adequacy of
the sharing and launching process. And finally, the “Previewing a Notebook” task looks to assess
the usability of the notebook viewer feature implemented in this dissertation and the advantages
it can bring to users of the platform. In the end of these tasks users would then answer a system
usability test and have a small discussion about the value of the implemented features.
4.2.3 Result Analysis
Here we present a brief overview and analysis of each of the tasks and sections of the interview
followed by a discussion over the extracted deductions.
54 Outcomes and Experiments
Figure 4.8: Characterization of subjects through level of experience with web notebooks, datamanagement and processing.
4.2.3.1 Test Subjects
In order to contextualize the ability of the test subjects within the context of this dissertation
the first step was to assess their level of experience with the technologies in question.
A total of eight participants were inquired. Four of these had ages comprised between 20-
25, while the remaining were between 25-35 years of age. They included two female and six
male participants. Each participant was asked their level of experience with web notebooks, data
management platforms and data processing tools from a scale ranging from unfamiliar to daily
user of the respective technology.
In Figure 4.8 we can see that from the target group of subjects we chose to interview many
are familiar with data management and processing from their work in InfoLab. However, as seen
in the figure, the usage of web notebooks is less common among our testers, as four users rank
as “Unfamiliar”, indicating that they have had no contact whatsoever with notebook technolo-
gies. This however didn’t present an obstacle to our experimental method since all of the subjects
were briefed on web notebooks and would interact with the implemented features in their tasks to
an extent that would allow to provide some overall considerations about the implemented work.
Furthermore, experience with data management platforms similar to Dendro is substantially more
valuable when providing considerations of the ability of this dissertation to influence reproducibil-
ity.
4.2.3.2 Time to Complete the Tasks
The first metric used to assess the usability of the implemented functions was the time to
completion of each task. In table 4.1 we can see the time each of the subjects took to perform
their tasks. All tasks on average, except for the “Executing the Notebook” task, were completed in
under two minutes by users that had no previous contact with the implemented work. Since tasks
were self contained and always included the full exercise of actions in order to obtain the desired
4.2 Experimental Outcomes 55
Figure 4.9: Graphical representation of time to task in “Creation of a Notebook”.
goal, the critical features of the notebook service are accessible in a reasonable time. This is a good
indicator that the implementation is effective in conveying the information necessary for a correct
usage of the newly added aspects to the platform. It should also be noted that the task “Executing
the Notebook” allowed users to freely explore the characteristics of the implementation so the time
to task is not a significant indicator since different users spent more time exploring while others
more quickly looked to resume the task.
Table 4.1: Time taken to complete each of the experimental tasks
Creation of a Notebook Upload/Creation of Notebook Files Executing the Notebook Sharing and Starting existing Notebook Previewing an External NotebookUser 1 00:54.07 02:53.09 01:44.31 01:46.49 01:51.02User 2 01:26.08 01:47.67 02:03.60 01:48.85 00:58.39User 3 01:48.63 00:44.32 01:09.76 01:24.43 02:16.13User 4 01:20.96 00:59.02 02:48.57 02:01.67 00:46.33User 5 01:16.60 01:06.15 02:47.16 01:59.70 01:37.16User 6 00:36.08 00:31.87 01:44.31 01:12.15 00:44.26User 7 00:47.64 00:45.10 02:34.16 01:51.75 00:39.30User 8 01:15.54 01:03.08 02:12.93 00:59.08 00:52.87Avg 1:10.70 1:13.79 2:08.10 1:38.02 1:13.18
To have more clear graphical overview over the time performance in these tasks, Figures 4.9,
4.10, 4.11, 4.12, and 4.13 represent graphically the results for the different users in each task
alongside a comparison with the respective average time value.
The time results on tasks related to previewing and creating notebooks are particularly sat-
isfying, indicating short times on the main notebook service functions of the implementation.
Alongside with the low amount of errors, as observed in the next section, they give us a good
metric for the usability of the platform.
4.2.3.3 Errors and Requests for Assistance
When the users were performing the scenario no critical errors were recorded. Under the
usability guide that drove this document, critical errors are significant deviations at the completion
level from the targets of the scenario. Since this didn’t occur through the experimental phase, we
can infer one of the usability goals of the document, a level of one hundred percent completion
rate in the designated tasks [48]. It should be noted that during the execution of the experiments
56 Outcomes and Experiments
Figure 4.10: Graphical representation of time to task in “Upload/Creation of Notebook Files”.
Figure 4.11: Graphical representation of time to task in “Executing the Notebook”.
Figure 4.12: Graphical representation of time to task in “Sharing and Starting existing Notebook”.
4.2 Experimental Outcomes 57
Figure 4.13: Graphical representation of time to task in “Previewing an External Notebook”.
one call for assistance was recorded, but it was due to a previously non-verbalized difficulty in
understanding the script of the experience and in no way represented an aspect to be considered
in the analysis of the implemented work. These tasks however were not free of incidents in which
the participants did not complete a scenario in the most optimal form. These are considered
non-critical errors, which, as previously referred, were divided into three categories navigation,
presentation and control usage errors.
There was only one possible scenario susceptible to control usage errors, namely the creation
of the notebook, since it required users to input a value for the naming of the notebook object. This
however didn’t came to fruition so there were no recorded control usage errors.
There were however several situations where both navigation errors and presentation errorsoccurred. In Figure 4.14 we can see the navigation errors by tasks while in Figure 4.15 we can see
the presentation errors.
Tasks one and four, “Creation of a Notebook” and “Sharing and Starting a Notebook”, respec-
tively, were where most of the navigation errors occurred, as seen in Figure 4.14. Through this
analysis we understood that some aspects within these actions can be improved. In the “Creation
of a Notebook” task, most navigation errors were caused by an inability of the users to find the
option to “Create a Notebook” within the project context, having returned to the project listing
screen or even trying to use the search function to create the notebooks. This led to some sugges-
tions to make the necessary steps to successfully create and use notebooks on the platform more
visible. In task number 4 — “Sharing and Starting a Notebook” — the reasoning is similar since
the control scheme is the same.
In many ways, the presentations errors seen in Figure 4.15 are similarly caused by the same
issues as the navigation errors were, namely by some difficulty in finding the correct objects or
misinterpreting some of the information conveyed by the user interface. It should also be noted
that on task 5 — “Previewing an External Notebook” — the errors were mostly caused by the
change in workflow, from the usage of the drop down menu previously used to start the notebook
to a drag and drop mechanism that caused some confusion.
It is interesting to analyse these results and try to extract meaningful insights and conclusions
58 Outcomes and Experiments
Figure 4.14: Cumulative values for navigation errors.
Figure 4.15: Cumulative values for presentation errors.
4.2 Experimental Outcomes 59
in order to improve the usability of this work. Ideally these features should be updated and the
experimental process repeated in order to improve usability.
4.2.4 System Usability Scale
To conclude this process, participants were prompted to answer a standard usability question-
naire, the System Usability Scale (SUS) [5]. This survey, attached in Appendix B, prompted the
subjects to choose an option on a scale from “Strongly Disagree” to “Strongly Agree”, to ten
statements regarding the implemented features.
1. I think that I would like to use this system frequently
2. I found the system unnecessarily complex
3. I thought the system was easy to use
4. I think that I would need the support of a technical person to be able to use this system
5. I found the various functions in this system were well integrated
6. I thought there was too much inconsistency in this system
7. I would imagine that most people would learn to use this system very quickly
8. I found the system very cumbersome to use
9. I felt very confident using the system
10. I needed to learn a lot of things before I could get going with this system
While these items individually are not meaningful, the SUS is capable of yielding a single
number to represent a measure of overall usability of the system [44]. These scores can range
from zero to one hundred and will be calculated for each of the SUS questionnaires answered by
the participants. In order to calculate this score we must sum contributions from each item, which
is variable. For items 1, 3, 5, 7, and 9, the score contribution is the value of the scale position minus
one. For items 2, 4, 6, 8 and 10, the contribution is five minus the value of the scale position, as
shown in the equation 4.1. In the end of this process, to obtain the overall value on the scale of a
hundred we must multiply all the scores by 2.5.
( ∑n=(Item1,3,5,7,9)
n−1)+( ∑n=(Item2,4,6,8,10)
5−n) (4.1)
The obtained results from the survey can be seen in table 4.2. The obtained SUS score is hard
to interpret if no scale of measurement is provided, but there are however several established ways
to measure the scores of the test. In Figure 4.16 different scoring methods are shown [43]. From
the presented methods the “percentile” and “acceptability” method are appropriate to extract some
conclusions over the obtained results. The percentile method indicates that the average score is
60 Outcomes and Experiments
Figure 4.16: Different scoring analysis methods associated with raw SUS scores.
68. This means that obtained that scores can be considered above and below average respectively.
The acceptability metric defining what is “acceptable” or “not acceptable, proposes acceptable to
correspond to roughly above 70 and unacceptable to below 50 [3]. These two methods support
each other. In Figure 4.17 we can see the scores, in red, of the SUS ranked in a graph conveying
the percentile and grade method. We can see that 5 out of 8 results score above the average and
therefore also on the “acceptable” category according to accessibility ranking. From the other 3
results that score below the average, two are within the 60 percentile and one of them is ranked
significantly below, with a score of 37.5. This score indicates that there are still some issues with
the user interface and workflow as previously discussed, since this score was directly from one of
the users that struggled the most with completing the tasks at hand, but it should also be stated that
this user had no previous experience with the platform or with notebooks.
Table 4.2: Results of the System Usability Survey
User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8Item 1 4 3 3 3 5 2 3 4Item 2 3 1 1 1 2 3 1 2Item 3 4 3 4 4 5 3 3 4Item 4 1 1 1 1 1 4 2 3Item 5 4 5 3 5 5 3 3 3Item 6 3 1 1 1 1 3 3 2Item 7 5 4 3 5 5 2 2 3Item 8 3 2 1 1 3 3 2 2Item 9 4 5 4 4 5 2 3 4Item 10 1 2 2 2 1 4 1 2Sum 30 33 31 35 37 15 25 27SUS Score 75 82.5 77.5 87.5 92.5 37.5 62.5 67.5
4.2.5 User Feedback
The final part of this procedure was a discussion with the subjects about their general opinion
on the impact of this work and how it delivered based on its initial goal.
4.2 Experimental Outcomes 61
Figure 4.17: SUS scores represented in red against percentile and grade graph
They were asked to fill another short form and finished by having a small personal discus-
sion where suggestions and opinions were registered. This form contained a series of statements
reflecting some of the goals of this implementation, and can be consulted in Appendix A.
The statements were the following:
• I believe this implementation brings improvement to the Dendro platform.
• This work improves the processing and data management aspects of data life cycle in the
platform.
• This work motivates the use of visualization in the platform.
• This work would brings improvements to the way data can be shared in the platform.
• This work can bring improvements to my workflow when dealing with data.
• I feel more compelled to create visualizations and annotate my work thanks to this imple-
mentation.
• This work makes the process of sharing and interpreting datasets more convenient.
• This work can have a significant impact in reproducibility of data.
From the analysis of Figure 4.18 we can infer that generally all the answers were positive and
that the implemented work can have a significant impact in the way the platform is used.
Some of the participants took the opportunity provided at the end of the questionnaire to pro-
vide some suggestion over the implemented features and some general views over this work.
62 Outcomes and Experiments
Figure 4.18: Results from the Implementation Appreciation Form
Some of these suggestions linked directly with the preview and visualization features of the
notebook service:
“I think the visualization (preview) aspect of the implementation should be more easily acces-
sible, previews for all the existing notebooks in a project would be helpful in order to reduce time
spent searching for notebooks.”
and
“I would like to see the option to explore notebooks existing in Dendro outside of the projects
themselves.”
These suggestions relating to the rendering and preview of notebooks are interpreted as a
signal of interest and relative success of the achieved implementation, since most subjects found
the feature interesting and wanted to see it expanded in other sections of the platform.
Some participants brought forward interesting suggestions relative to the notebook workflow.
“I would like to upload my notebooks and automatically generate notebook directories.”
By large the subjects that took part in this questionnaire accompanied the implementation of
the notebook service and helped shape this work. Its a good indicator that their level of satisfaction
with the features hits some of the goals envisioned for the platform by this work.
There is a general consensus that sharing visualizations and processing methods within the
platform has been improved by this work.
To extract a concrete value from Figure 4.18, we can look at table 4.3, where, to obtain a single
result for each of the statements, we calculated the average of the stacked results and rounded to
the closest factor from “Strongly Disagree” to “Strongly Agree”. The obtained results consist in
4.2 Experimental Outcomes 63
six statements getting the “Somewhat Agree” classification and two obtaining the highest classi-
fication in the scale, “Strongly Agree”. These are noticeably good results especially considering
that the “This work can have a significant impact in reproducibility of data.” statements, the main
goal of this implementation, achieved the “Strongly Agree” classification by the participants in the
survey.
Table 4.3: Results for satisfaction form in table form
StronglyDisagree
SomewhatDisagree
Neither Agreeor Disagree
SomewhatAgree
StronglyAgree
AverageScore
FinalRating
I believe this implementationbrings improvementto the Dendro platform
0 0 3 3 2 3,875Somewhat
Agree
This work improves theprocessing and data managementaspects of data life cycle in the platform
0 0 1 5 2 4,125Somewhat
Agree
This work motivates the useof visualization in the platform
0 0 2 2 4 4,25Somewhat
AgreeThis work would bringsimprovements to the way datacan be shared in the platform
0 0 2 4 2 4Somewhat
Agree
This work can bring improvementsto my workflow when dealing with data
0 0 1 4 3 4,25Somewhat
AgreeI feel more compelled to createvisualizations and annotate mywork thanks to this implementation
0 0 2 5 1 3,875Somewhat
Agree
This work makes the process ofsharing and interpreting datasetsmore convenient
0 0 0 4 4 4,5Strongly
Agree
This work can have a significantimpact in reproducibility of data
0 0 0 4 4 4,5Strongly
Agree
One of the core goals of the implemented work was to increase reproducibility of scientific
data, and the researchers that were interviewed in the conduction of this survey agree that some of
the essential tools in order to achieve this have been introduced within Dendro.
64 Outcomes and Experiments
Chapter 5
Conclusions and Future work
In this final chapter, an overview of all developed work is presented. We reflect on the success
of the implementation and the main accomplishments and conclusions achieved in this disserta-
tion. We also provide some suggestions for the further development of this work either being
tangible features that weren’t implemented or interesting concepts that can possibly be expanded
upon.
5.1 Conclusions
When establishing the goals for this dissertation, the fundamental aspect to be demonstrated
was that, through the inclusion of web notebooks as a service in data management platforms, we
would be able to achieve higher levels of reproducibility as a consequence of pairing data with
respective processing code and visualizations. Through the observed results of our experimental
method and consultation with the selected expert user we have a positive outlook. Even though the
sample is small, most of the users that came across this implementation recognised its potential
in improving the workflow and appeal to a more careful treatment of research data. However the
concept at discussion is relatively new and therefore only time will tell if it will be successfully
accepted and used as a standard. The implementation however was successful in providing tools
capable of incorporating processing and visualization with the respective data. In that light, the
implementation of this dissertation provides us with positive outlooks on these matters and sets
a good basis to expand upon. The state of the art analysis in particular was fundamental towards
understanding the importance of solutions encouraging reproducibility. There is a widespread
understanding that Open Science benefits will be at the core of a more efficient scientific pro-
cess [30]. Open Data in particular is where this implementation can have an effect since it seemed
to progress at a slower pace than other categories [45]. Web notebooks, with their accessibility and
versatility are suggested by many as a solution for this problem, by combining the features of these
platforms with the already established scientific data management platforms to bring innovative
ways to approach open data [54, 38].
65
66 Conclusions and Future work
Through the interviews with researchers we can understand that several fields of study use
Dendro as a platform. To provide visualization solutions that allow techniques from data monitor-
ing to presentation is essential in order to guarantee that the platform has an appeal for the multiple
disciplines of the user groups that work with Dendro. By analysing and implementing solutions
that support visualization it became clear how much visualization is capable of impacting scien-
tific research. With the large availability of data and ease of storing very large amounts of data it
becomes fundamental to couple structures that hold this data with methods to perform annotation
in a more appealing way than plain text.
The experimental process and the user feedback received indicates a positive outlook on the
achieved results of this implementation. Around fifty percent of the inquired participants agreed
strongly that this implementation facilitates the sharing and interpretation of data. The same num-
ber of participants also agree that this work can have a significant impact on the reproducibility
of data. The conversations carried out with these participants indicated that the adaptability of the
notebook and the interactions with the data itself and visualization are the aspects that approach
this implementation’s goals of improving reproducibility and encouraging Open Science through
the ability to use web notebooks.
This dissertation has built the foundation for a highly versatile platform capable of combining
traditional data management with visualization and processing, effectively making data feel lively
within Dendro. The full extent to which these capabilities can be explored is still not certain given
the modular nature and possibility for expansion of Jupyter Notebook, but the base work has been
set and can now be expanded upon.
5.2 Future Work
Different aspects can be improved upon in order to contribute to a more versatile web note-
book service. One of the ways this work can be improved upon is by the integration of different
visualization libraries with the current established notebook. The current integration of Jupyter
Notebook with Dendro allows for a considerable number of visualization solutions to be used but
does not guarantees compatibility with some popular libraries that are not native to Jupyter. Gath-
ering a list of some of the popular libraries (based on this state of the art work for instance) and
integrating them with the current Jupyter image would strengthen this service.
The synchronization process between the notebook containers and the database can still be
improved, specifically when dealing with large projects with a big number of files or with the
periodic nature of the synchronization process.
The internal structure of the notebook within Dendro is still limited being very similar in nature
to project folders. This hinders the process of cloning or for instance forking notebooks by other
projects, making them static within the project they are contained in. It would be interesting to
see this feature expanded allowing for different research groups to more easily access each others
work.
5.2 Future Work 67
Dendro is composed of several aspects, for instance the social features, and not all of these
aspects are linked to this new service. It would be interesting to allow cross platform notification
for successful notebooks or even using these social features as a way to share these notebooks,
increasing the reproducibility potential of this implementation.
These are just some aspects that can further improved in order to additionally push towards
the vision of improving reproducibility through the usage of web notebooks.
68 Conclusions and Future work
References
[1] Ricardo Carvalho Amorim, João Aguiar Castro, João Rocha da Silva, and Cristina Ribeiro.A comparison of research data management platforms: architecture, flexible metadata andinteroperability. Universal Access in the Information Society, 16(4):851–862, 2017.
[2] Monya Baker. “Is there a reproducibility crisis? A Nature survey lifts the lid on howresearchers view the crisis rocking science and what they think will help”. Nature, vol. 533,no. 7604, p. 452+, 2016.
[3] Aaron Bangor. An Empirical Evaluation of the System Usability Scale: InternationalJournal of Human–Computer Interaction: Vol 24, No 6.
[4] Shannon Bohle. What is E-science and How Should it be Managed? Spektrum derWissenshaft, 2013.
[5] John Brooke. SUS-A quick and dirty usability scale. Technical report.
[6] S. K. Card, J. D. Mackinlay, and B. Schneiderman. Readings in Information Visualization:Using Vision to Think. San Francisco: Morgan Kaufmann. 1999.
[7] Stuart K Card and Jock Mackinlay. "The structure of the information visualization designspace," Proceedings of VIZ ’97: Visualization Conference, Information VisualizationSymposium and Parallel Rendering Symposium, Phoenix, AZ, USA,pp. 92-99, 1997.
[8] CKAN. Using CKAN: storing data for re-use.https://ckan.org/files/2012/08/OKF-OR12-poster.pdf.
[9] European Comission. Trends for open access to publications.https://ec.europa.eu/info/research-and-innovation/strategy/goals-research-and-innovation-policy/open-science/open-science-monitor/trends-open-access-publications_en.
[10] European Comission. Open Science Monitor Study on Open Science: Monitoring Trendsand Drivers. Technical report, 2019.
[11] D3. D3.js - Data-Driven Documents, https://d3js.org.
[12] João Rocha da Silva, Cristina Ribeiro, and João Correia Lopes. Ranking Dublin Coredescriptor lists from user interactions: a case study with Dublin Core Terms using theDendro platform. International Journal on Digital Libraries, pages 1–20, 2018.
[13] Paul A David, Ralph Schroeder, William H Dutton, Paul Jeffreys, and Matthijs Den Besten.Will e-Science Be Open Science? Technical report, 2009.
[14] Distill. Distill.Pub, https://distill.pub.
69
70 REFERENCES
[15] Docker Inc. Docker overview, https://docs.docker.com/engine/docker-overview.
[16] Docker Inc. Empowering App Development for Developers, https://www.docker.com.
[17] Thea Marie Drachen, Ole Ellegaard, Asger Væring Larsen, and Søren Bertil FabriciusDorch. Sharing Data Increases Citations. LIBER QUARTERLY, 26(2):67–82, aug 2016.
[18] Distill Editors. Distill Update 2018. https://distill.pub/2018/editorial-update/.
[19] GoFAIR.org. FAIR Principles - GO FAIR, https://www.go-fair.org/fair-principles.
[20] InfoLab Information Systems Research Group. InfoLab | Information Systems ResearchGroup, https://infolab.fe.up.pt.
[21] Hadley Wickham. Create Elegant Data Visualisations Using the Grammar of Graphics |ggplot2, https://ggplot2.tidyverse.org.
[22] Jeffrey Heer and Ben Shneiderman. Visualization 1 Interactive Dynamics for VisualAnalysis A taxonomy of tools that support the fluent and flexible use of visualizations.Technical report.
[23] I. Herman, G. Melancon, and M.S. Marshall. Graph visualization and navigation ininformation visualization: A survey. IEEE Transactions on Visualization and ComputerGraphics, 6(1):24–43, 2000.
[24] Observable Inc. Observable. https://beta.observablehq.com/.
[25] Michael Droettboom John Hunter, Darren Dale, Eric Firing. Matplotlib: Python plotting,https://matplotlib.org.
[26] Jupyter Project. Jupyter Notebook. https://jupyter.org/, 2019.
[27] Margaret Rouse. What is a Docker Image and How is it Used?,https://searchitoperations.techtarget.com/definition/Docker-image, 2019.
[28] Bruno Monteiro Marques, João Rocha da Silva, and Tiago Devezas. Visualization inReproducible Science: A comparative overview of interactive Web Journals andcomputational notebooks. pages 7–10. 2019 14th Iberian Conference on InformationSystems and Technologies (CISTI), 2019.
[29] Wolfram Mathematica. Wolfram Mathematica: Computação técnica moderna,https://www.wolfram.com/mathematica.
[30] Marcia McNutt. Improving Scientific Communication. Science, 342(6154):13–13, oct2013.
[31] Ingeborg Meijer and Tung Tung Chan. The Netherlands’ Plan on Open Science. Technicalreport, European Commission, 2018.
[32] Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. Challenges of mapping currentCKAN metadata to DCAT. Technical report.
[33] Jakob Nielsen. Why You Only Need to Test with 5 Users. Nielsen Norman Group, mar2000.
REFERENCES 71
[34] Chris Olah and Shan Carter. Attention and Augmented Recurrent Neural Networks. Distill,Volume 1, sep 2016.
[35] Astrid Orth, Nancy Pontika, and David Ball. FOSTER’s Open Science Training Tools andBest Practices. In: Positioning and Power in Academic Publishing: Players, Agents andAgendas: Proceedings of the 20th International Conference on Electronic Publishing(Loizides, Fernando and Schmidt, Birgit eds.). pages 135–141, 2016.
[36] Claus Pahl, Antonio Brogi, Jacopo Soldani, and Pooyan Jamshidi. Cloud ContainerTechnologies: a State-of-the-Art Review. IEEE Transactions on Cloud Computing,7161(c):1–14, 2017.
[37] Jeffrey M. Perkel. Data visualization tools drive interactivity and reproducibility in onlinepublishing. Nature 2018 554:7690, jan 2018.
[38] Bernadette M. Randles, Irene V. Pasquetto, Milena S. Golshan, and Christine L. Borgman.Using the Jupyter Notebook as a Tool for Open Science: An Empirical Study. InProceedings of the ACM/IEEE Joint Conference on Digital Libraries, pages 1–2. IEEE, jun2017.
[39] João Rocha Da Silva, Cristina Ribeiro, and João Correia Lopes. Ontology-basedmulti-domain metadata for research data management using triple stores, IDEAS ’14:Proceedings of the 18th International Database Engineering Applications Symposium,Pages 105–114, july 2014.
[40] Margaret Rouse. What is the Data Lifecycle,https://whatis.techtarget.com/definition/data-life-cycle, 2017.
[41] Adam Rule, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, RobKnight, Niema Moshiri, Mai H. Nguyen, Sara Brin Rosenthal, Fernando Pérez, andPeter W. Rose. Ten Simple Rules for Reproducible Research in Jupyter Notebooks.Technical report, 2018.
[42] Arvind Satyanarayan, Dominik Moritz, Kanit Wongsuphasawat, and Jeffrey Heer.Vega-Lite: A Grammar of Interactive Graphics. 2017.
[43] Jeff Sauro. MeasuringU: 5 Ways to Interpret a SUS Score.https://measuringu.com/interpret-sus-score.
[44] U.S. Department of Health & Human Services. About Us | Usability.gov.https://www.usability.gov/about-us.
[45] Elta Smith and Salil Gunashekar. Open Science Monitoring The team at RAND Europeincluded: Impact Case Study Zenodo. Technical report.
[46] Traefik. Traefik, The Cloud Native Edge Router | Containous, https://containo.us/traefik.
[47] University of Cambridge. How much do publishers charge for Open Access? | OpenAccess, https://www.openaccess.cam.ac.uk/paying-open-access/how-much-do-publishers-charge-open-access.
[48] U.S. Department of Health & Human Services. Methodshttps://www.usability.gov/how-to-and-tools/methods/index.html.
72 REFERENCES
[49] Usability.Gov. Improving the User Experience - Use Cases.https://www.usability.gov/how-to-and-tools/methods/use-cases.html.
[50] J.J. van Wijk. The Value of Visualization. In IEEE Visualization 2005 - (VIS’05), pages11–11. IEEE.
[51] Vega Project. Examples gallery Vega. https://vega.github.io/vega/examples/.
[52] Junpeng Wang, Subhashis Hazarika, Cheng Li, and Han Wei Shen. Visualization and VisualAnalysis of Ensemble Data: A Survey. IEEE Transactions on Visualization and ComputerGraphics, PP(8):1, 2018.
[53] Mark D Wilkinson. The FAIR Guiding Principles for scientific data management andstewardship. 2016.
[54] Jack Zentner and Tom McDermott. Web notebooks as a knowledge management tool forsystem engineering trade studies. In 11th Annual IEEE International Systems Conference,SysCon 2017 - Proceedings, 2017.
Appendix A
Implementation OverviewQuestionnaire
73
1.
2.
Marcar apenas uma oval.
Not familiar with the concept
1 2 3 4 5
Daily Usage
3.
Marcar apenas uma oval.
Not familiar with the concept
1 2 3 4 5
Daily Usage
4.
Marcar apenas uma oval.
Not familiar with the concept
1 2 3 4 5
Daily Usage
General Appreciation
Subject Pro�lesThis form is to be filled by the facilitator/data logger
Subject ID
How would you describe your level of experience with Web Notebooks?
How would you describe your level of experience with Data ManagementPlatforms?
How familiar are you with data processing tools?
5.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
6.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
7.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
8.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
I believe this implementation brings improvement to the Dendro Platform.
This work improves the processing and data management aspects of data life cyclein the Platform.
This work motivates the use of visualization in the Platform.
This work would brings improvements to the way data can be shared in thePlatform.
9.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
10.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
11.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
12.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
This work can bring improvements to my workflow when dealing with data.
I feel more compelled to create visualizations and annotatte my work thanks to thisimplementation.
This work makes the process of sharing and interpreting datasets moreconvenient.
This work can have a significant impact in reproducibility of data.
13.
Este conteúdo não foi criado nem aprovado pela Google.
Improvements/Suggestions for implementations.
Formulários
78 Implementation Overview Questionnaire
Appendix B
System Usability Scale Form
79
1.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
2.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
3.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
System Usability ScaleThis is a form using defined metrics in order to obtain a System Usability Scale
I think that I would like to use this system frequently
I found the system unnecessarily complex
I thought the system was easy to use
4.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
5.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
6.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
7.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
I think that I would need the support of a technical person to be able to use thissystem
I found the various functions in this system were well integrated
I thought there was too much inconsistency in this system
I would imagine that most people would learn to use this system very quickly
8.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
9.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
10.
Marcar apenas uma oval.
Strongly Disagree
1 2 3 4 5
Strongly Agree
Este conteúdo não foi criado nem aprovado pela Google.
I found the system very cumbersome to use
I felt very confident using the system
I needed to learn a lot of things before I could get going with this system
Formulários
Appendix C
Introductory Guide to System TestingProcedure
83
Welcome and Purpose Thank you so much for coming in today. Before we begin I wanted to give you some information about what you will be performing in this test session and proceed to give you some time to ask any questions you find appropriate. In this session we are asking you to serve as an evaluator of a set of website functionalities by completing a number of tasks. The goal is to determine how easy or difficult you find the usage of the website and these features. Afterwards recording a general opinion over the value of the system. I am here to record your reactions and comments about the website during the tasks you will be performing. Test Facilitator’s Role During this session, you are free to think aloud as you work to complete the tasks. I will not be able to offer any suggestions or hints, but from time to time, I may ask you to clarify what you have said or ask you for information on what you were looking for or what you expect to have happen. Test Participant’s Role Today I am going to be asking you to perform some actions on the website and tell me how easy or difficult it was to perform this tasks. These activities have as purpose to evaluate the ease of use of the implemented features. There are no right or wrong answers. If you have any questions, comments or areas of confusion while you are working, please let me know. If you ever feel that you are lost or cannot complete a task with the information that you have been given, please let me know. I will ask you what you might do in a real-world setting and then either put you on the right track or move you on to the next scenario. As you use the site, please do so as you would at home or your office. I would ask that you to try work through the tasks based on what you see on screen, but if you reach a point where you are not sure where or how to find something please inform me. Your name will not be associated or reported with data or findings from this evaluation. I may ask you other questions as we go and we will have wrap up questions at the end. Do you have any questions before we begin?