\section{Project I: Performance Analysis}
\label{sec:projectI}
In this Section we describe the specifications of the timing
experiments and how we obtain the results.
Our notation is the following:
\begin{itemize}
\item $N$: total number of rows;
\item $P$: number of predictor variables, so the total number of
columns is $P+1$;
\item $R$: number of subsets;
\item $M$: number of rows in each subset, so $M=N/R$,
\item $C$: number of cores on the cluster, and $C=12$ on the
deneb/mimosa cluster.
\end{itemize}
For a given set of values of $N$, $P$, and $R$, we will generate $R$
number of subsets, and each subset is a matrix of $M$ rows and $P+1$
columns. We will perform two computing tasks on the data set
generated. The first task is running logistic regression on each
subset of the data, and then sum up estimates of regression
coefficients across subsets. We call the first task glmfit. The second
task is mimicking the I/O of the first task by reading the same amount
of data and writing out the same amount of output. We call the second
task nullfit. We will record the timings for both tasks and analyze
the timing results. The code for generating data sets and running
these two tasks are provided in {\tt lec3.r} under the cvs directory
{\tt rrhipehadoop/notes/}.
\subsection{Experiment I and II}
We fix $N$, $P$ and let $R$ vary in each of the first two experiments.
Specifications for experiment 1:
\begin{itemize}
\item $N=2^{20}$ is fixed,
\item $P=7$ is fixed,
\item $R=2^3, 2^4, \ldots, 2^{13}$, and correspondingly
\item $M=2^{17},2^{16},\ldots,2^7$.
\item The total size of a data set consisting of $R$ subsets is 64MB.
\end{itemize}
Specifications for experiment 2:
\begin{itemize}
\item $N=2^{25}$ is fixed,
\item $P=15$ is fixed,
\item $R=2^7,2^8,\ldots,2^{17}$, and
correspondingly
\item $M=2^{18},2^{17},\ldots,2^8$.
\item The total size of a data set consisting of $R$ subsets is 4GB.
\end{itemize}
We will realize 3 repetitions of both glmfit and nullfit for each of
the 11 values of $R$ in both experiments. So, from each experiment, we
will produce $11\times 2\times 3=66$ timing records in total.
Through previous experiments, we have observed a decreasing running
time if we run map-reduce jobs consecutively with the same input data
set. This is probably due to input data set getting cached in memory
from the first job. So we will define a protocol for the experiment to
overcome this problem.
Protocol of experiment 1 and experiment 2:
\begin{enumerate}
\item Run the experiment in the order of $R$, from the smallest value
to the largest.
\item Re-generate the data set before run each computing task, i.e.,
we will use different data sets for glmfit and nullfit for a given
value of R.
\end{enumerate}
Display of the results from the experiments:
\begin{enumerate}
\item We will save the timing records in text files, and the format of
the text file is provided in {\tt lec3.r}. The text files should be
named as
{\tt your\_career\_account.timing.results.experiment\_number.txt},
e.g., Jeff's results from experiment 1 should be
{\tt li271.timing.results.1.txt}.
\item Each of us will have a separate tex file for documenting the
experiment procedure and displaying results. The tex files should be
named as {\tt your\_career\_account.timing.tex}, e.g., Jeff's file
should be {\tt li271.timing.tex}. The {\tt .tex} file should be
added as a subsection under Section ``Project I: Performance
Analysis'' (Section~\ref{sec:projectI}) of the {\tt main.tex}. Note
that one person will have only one {\tt .tex} file and only one
subsection, and you will display the results of both experiments in
the same {\tt .tex} file.
\item Each of us will plot the timing results and add the plots to
his/her {\tt .tex} file. The plots should be named as {\tt
your\_career\_account.timing.plot.experiment\_number.ps}, e.g.,
Jeff's plot for experiment 1 should be {\tt li271.timing.plot.1.ps}.
All three repetitions should be plotted and the minimum of the three
should be connected by lines. The code for plotting is provided in
{\tt lec4.r} and you will need to edit your own title of the plot
for identifying whose plot and which experiment.
\item Files including text files for raw results, {\tt .tex} files for
documentation, {\tt .ps} files for plots, should be added to the cvs
directory. Instructions for using cvs are in Dr. Cleveland's email
and in {\tt lec5.r} as well.
\end{enumerate}
\subsection{Experiment III}
Specifications of experiment 3:
\begin{itemize}
\item $M=2^{11}$ is fixed,
\item $P=15$ is fixed,
\item $R=2^{15},2^{16},\ldots,2^{19}$, 5 different values, and correspondingly
\item $N=2^{26},2^{27},\ldots,2^{30}$,
\item For each combination of $P$, $N$, and $M$, perform two compute tasks, glmfit and nullfit (coded as M + R/W and R/W respectively), make sure to generate a new dataset before each compute task of each combination of $P$, $N$, and $M$,
\item Repeat each run 3 times,
\item Together, there will be $5\times 2\times 3=30$ timing records.
\end{itemize}
In experiment 3, we will follow the same protocol as experiment 1 and 2.
We will also display the results in the same way.
\subsection{Experiment IV}
Specifications of experiment 4:
\begin{itemize}
\item $P=15$ is fixed,
\item $N=2^{19},2^{20},\ldots,2^{28}$, 10 different values,
\item $M=2^{10},2^{11},2^{12}$, 3 different values,
\item For each combination of $P$, $N$, and $M$, perform two compute tasks, glmfit and nullfit (coded as M + R/W and R/W respectively), make sure to generate a new dataset before each compute task of each combination of $P$, $N$, and $M$,
\item We will not realize any repetitions, and the runs done by each person will serve as the repetitions,
\item Together, there will be $10\times 3 \times 2=60$ timing records.
\end{itemize}
Requirements for displaying results for experiment 4:
\begin{enumerate}
\item The timing records will be saved in text files, following the same naming convention defined above. The text files should contain 60 rows and 4 columns: $\log_2{N}$, $\log_2{M}$, compute task (glmfit or nullfit), and timing.
\item To visually display the results, plot log base 2 timing against log base 2 subset size ($\log_2{M}$) conditioning on total number of rows ($N$) and compute task (M + R/W and R/W). This means we have 20 panels and in each panel, 3 points will be drawn and connected by lines. Use {\tt xyplot()} in lattice package and arrange the panels in 10 columns and 2 rows such that panels in the same columns have the same value of $N$ and panels in the same rows have the same compute task. This can be done by setting {\tt layout=c(10,2)}. Draw one plot with the same scale on the vertical axis (default setting of {\tt xyplot()}) and draw another plot setting {\tt scales=list(y="sliced")}. Name the plots {\tt career\_account.timing.plot.4.ps} and {\tt career\_account.timing.plot.4.sliced.ps} accordingly.
\end{enumerate}
\subsection{Extended Experiment IV}
This is an extension of experiment 4, and the specifications are:
\begin{itemize}
\item $P=15$ is fixed,
\item $N=2^{19},2^{20},\ldots,2^{28}$, 10 different values,
\item $M=2^{9},2^{10},2^{11},2^{12},2^{13}$, 5 different values,
\item for each combination of $P$, $N$, and $M$, perform two compute tasks, glmfit and nullfit (coded as M + R/W and R/W respectively), make sure to generate a new dataset before each compute task of each combination of $P$, $N$, and $M$,
\item together, there will be $10\times 5 \times 2=100$ timing records.
\end{itemize}
The requirements for displaying results are the same as experiment 4.
\subsection{Extended Experiment V}
Specifications of experiment 5:
\begin{itemize}
\item $P=15$ is fixed.
\item $\log_2N=23,25,\ldots,30$, there are 8 different values. Each of us will be assigned with one value of $N$ and run the experiment with the assigned $N$. The assignments are: 23-Mel, 24-Cheng, 25-Jorge, 26-Han, 27-Qiming, 28-Xiaosu, 29-Kyle, 30-Jeff.
\item $\log_2M=8,9,\ldots,,18$, there are 11 different values.
\item Perform two compute tasks, glmfit and nullfit (coded as M + R/W and R/W respectively).
\item For each combination of $P$, $N$, $M$, and compute tasks, realize three replicates.
\item There will be $11\times 2 \times 3=66$ timing records for each person, and collectively there will be $66 \times 8=528$ timing records.
\end{itemize}
Protocol of experiment 5:
\begin{enumerate}
\item For an assigned value of $N$, generate 11 datasets each with a different value of $M$ in the order of $M$ from smallest to largest.
\item Run glm fit with the datasets in the order of $M$ from smallest to largest.
\item Repeat step 1 and 2 two more times.
\item For a given value of $N$, generate 11 datasets each with a different value of $M$ in the order of $M$ from smallest to largest.
\item Run null fit with the datasets in the order of $M$ from smallest to largest.
\item Repeat step 4 and 5 two more times.
\end{enumerate}
The code for running the experiment is provided in {\tt lec8.r} under cvs directory {\tt rrhipehadoop/notes/}. Note you will just need to change the value of {\tt name} and {\tt N} in the code.
The requirements for displaying results of experiment 5:
\begin{enumerate}
\item The timing records will be saved in text files, following the same naming convention defined above, e.g., Jeff's results will be saved in {\tt li271.timing.results.5.txt}. The text files should contain 66 rows and 6 columns: $\log_2{N}$, $\log_2{M}$, compute task (glmfit or nullfit), and three timing records.
A sample text file {\tt li271.timing.results.5.txt} is provided under cvs {\tt rrhipehadoop/notes/}.
\item It is the minimum among 3 replicated timing records that we are interested in and we will generate two plots to display the minimums. In the first one, plot log base 2 time (minimum of 3) against log base 2 number of rows per subset ($\log_2{M}$) and name the file {\tt career\_account.timing.plot.5.ps}; in the second one, plot log base 2 time (minimum of 3) against log base 2 number of subsets ($\log_2{R}$) and name the file {\tt career\_account.timing.plot.5.R.ps}. In both plots, we will condition on compute task, which has two levels M + R/W and R/W. We introduce another level for compute task, coded as M, to be the difference between log base 2 time (minimum of 3) of M + R/W and log base 2 time of R/W (minimum of 3). Both plots will have 3 panels and 11 points in each panel, and the panels should be ordered from M + R/W, R/W, to M. The code for producing the plots is provided in {\tt lec8.r}.
Two sample plots are provided under cvs {\tt rrhipehadoop/notes/}.
\end{enumerate}