Skip to content

Latest commit

 

History

History
273 lines (209 loc) · 19.8 KB

File metadata and controls

273 lines (209 loc) · 19.8 KB

Overview

The paper makes the following claims requiring artifact evaluation on page 2 (Comments to AEC reviewers are after :):

  1. Execution engine: Fractal's light-weight instrumentation, progress and health monitors, and the executor.
  2. Performance optimizations: Fractal's event-driven executor, buffered-io sentinel striping, and batched scheduling.
  3. Fault injection: Fractal's internal subsystem, frac, that enables large-scale characterization of fault recovery behaviors.

This artifact targets the following badges (mirroring the NSDI26 artifact "evaluation process"):

  • Artifact available: Reviewers are expected to confirm public availability of core components (~5mins)
  • Artifact functional: Reviewers are expected to verify distributed execution workflow and run a minimal "Hello, world" example (~10mins).
  • Results reproducible: Reviewers are expected to reproduce the key result: Fractal’s correct and efficient fault recoveryfor both regular-node and merger-node failures, demonstrated by its performance compared to fault-free conditions (Fig. 7, ~65mins, optionally ~1 week).

Note that Fractal builds on top of DiSh, an MIT-licensed open-source software that is part of the PaSh project.

To "kick the tires" for this artifact: (1) Skim this README file to get an idea of the artifact's structure (2 minutes), and (2) Jump straight into the exercisability section of the README file (5 minutes).

We will provide the private key and password through HotCRP for reviewers to access the control nodes for both clusters. Save the provided private key to your local machine (e.g., as fractal.pem), and change its permission using chmod 400 fractal.pem. When connecting via ssh, enter the provided password when prompted.

Important

We have reserved a 4-node cluster and a 30-node cluster on CloudLab from August 1st to August 24th for artifact evaluation. Reviewers should coordinate with each other to not run experiments at the same time—i.e., use HotCRP comments to notify each other of "locking" the infrastructure until a certain date (ideally, no more than a day). Please start evaluation early (in the background), as this kind of resource locking will delay the artifact evaluation process!

[Update on August 24th] We hope everything is going smoothly so far. As mentioned earlier, our CloudLab reservations run through August 24th. However, we will continue doing our best to allocate machines for reviewers. Depending on resource availability, we will periodically update the cluster IPs in the document. Please keep us posted on HotCRP about how things are going—we are always here to help!

Artifact Available (~5mins)

Confirm Fractal is publicly available on GitHub. Below are some relevant links:

  1. Fractal is openly hosted on GitHub (repo), including benchmarks, evaluation scripts, and frac.
  2. The Fractal repo is also permanently hosted on Zenodo, as an additional level of archival assurannce.
  3. All data used in these experiments are publicly available (see URLS in this README's Appendix), as part of the Koala benchmark suite (Usenix ATC'25 paper, [website](https://kben.sh/, full inputs).
  4. Fractal's command annotations conform to the ones from PaSh, another MIT-licensed open-source software.
  5. We have a publicly-accessible discord Server (Invite) for troubleshooting and feedback.

We note that Fractal is MIT-licensed open-source software, part of the PaSh projecct and hosted by the Linux Foundation.

Artifact Functional (~10mins)

Confirm sufficient documentation, key components as described in the paper, and the system's exercisability:

Documentation: Fractal contains documentation of its top-level structure (e.g., overall architecture, control-plane, runtime), its key components (e.g., remote pipes, DFS reader, runtime helpers), its setup and evaluation (e.g., cluster boostrap, evaluation), modularized fault injection subsystem, frac, and other elements (e.g., contribution, community.

Copmleteness: The repository's top-level README file offers a high-level overview. In more detail: to support fault-tolerant execution, Fractal:

  1. extends the dataflow compilation with sugraphs and wrappers and the worker manager with subgraph-to-node mapping, dependency tracking, and selective re-execution (§4.1–§4.2), and introduces a runtime datastream wrapper to decide whether to spill a stream to disk (--ft dynamic flag and a -s (singular) tag in each RemotePipe), reexecuting only non-persisted outputs, and polls HDFS via JMX callbacks wired into the scheduler.
  2. optimizes execution through an event-driven worker runtime (§5.1) whose lock-free EventLoop launches up to N subgraphs and TimeRecorder logs execution, buffered-IO sentinel stripping (§5.2) where 8-byte EOF tokens are removed on-the-fly using a single 4096-byte buffer, and batched scheduling (§5.3) where the worker manager builds worker_to_batches and issues one Batch-Exec-Graph RPC per worker;
  3. introduces a fault-injection component supported by helpers that terminate entire process trees (which evaluation scripts driving these hooks to reproduce the fault-tolerance experiments of §6). The automated fault injection subsystem is modularized under frac, which contains two minimal examples. Together these files (and the PaSh-JIT submodule they build upon) cover every component shown in Fig. 3, demonstrating that the released code fully realises the design presented in the paper.

Exercisability: (1) Scripts and data: Scripts to run experiments are provided in the ./evaluation directory: evaluation/run_all.sh runs all benchmarks, and therun.sh in each benchmark folder (e.g., the one in classics runs individual benchmarks. Input data are downloadable viainputs.sh, which fetches datasets from persistent storage hosted on a Brown University (see Appendix I). (2) Execution: To facilitate evaluation, we pre-allocate and initialize a 4-node and a 30-node cluster with all input data pre-downloaded, available via the fractal account (see HotCRP for passwords). To connect to the control node of each cluster:

# Connect to the 4-node cluster
ssh -i fractal.pem fractal@ms1044.utah.cloudlab.us
# Connect to the 30-node cluster
ssh -i fractal.pem fractal@ms1029.utah.cloudlab.us

To connect to the client container:

# Run the interactive shell inside the client continaer
sudo docker exec -it docker-hadoop-client-1 bash

To run Fractal with a minimal echo example under a fault-free setting:

$FRACTAL_TOP/pash/pa.sh --distributed_exec -c "echo Hello World!" 

Results Reproducible (~65mins)

The key result in this paper’s evaluation is that Fractal provides correct and efficient recovery for both regular-node and merger-node failures. This is demonstrated by its performance compared to fault-free conditions (§6.2, Fig. 7). These results were produced using the --full inputs of the Koala benchmarks; to accelerate artifact evaluation, in this section we will be using the --small inputs.

Terminology correspondence: Here is the correspondence of flag names between the paper and the artifact:

  • Fractal (no fault): --width 8 --r_split --distributed_exec --ft dynamic
  • Fractal (regular-node fault): --width 8 --r_split --distributed_exec --ft dynamic --kill regular
  • Fractal (merger-node fault):--width 8 --r_split --distributed_exec --ft dynamic --kill merger

Execution and plotting: This section provides detailed instrauctions on how to replicate Fig. 7 of the experimental evaluation of Fractal as described in Table 2: Classics, Unix50, NLP, Analytics, and Automation.

Important

For this step, we recommend using screen or tmux to avoid accidental disconnect. The plot requires data from both clusters. We recommend starting experiments on both clusters simultaneously, as each takes ~65mins to complete.

To run all the benchmarks from the control node of both clusters:

# open the interactive shell for the client container
sudo docker exec -it docker-hadoop-client-1 bash

# enter the eval folder
cd $FRACTAL_TOP/evaluation

# cleanup previous results 
bash cleanup_all.sh

# There are two options here, either use --small or --full as an argument to determine the input size.
# To facilitate the review process, we populate the data using `bash inputs_all.sh --small` (~20 minutes)
# Optionally, reviewers can run `bash inputs_all.sh` to clean up and regenerate all data from scratch.
bash run_faulty.sh --small

Generating the plots requires data from both clusters. To parse the per-cluster results, run the following command with --site 4 for the 4-node cluster:

# Parse results for a single cluster
./plotting/scripts/parse.sh --site 4

Or --site 30 for the 30-node cluster:

./plotting/scripts/parse.sh --site 30

After parsing results from both clusters, run the following command from one of the control nodes to generate the final figures by aggregating the results:

# Generate the plots
./plotting/scripts/plot.sh ms1044.utah.cloudlab.us ms1029.utah.cloudlab.us

Once the script completes, follow its prompt open the following URLs in a browser to view the generated figures, for example:

Fig. 7: http://<node>.cloudlab.us/fig7.pdf

Example output generated from the artifact:

example-output

Optional: Additional Experiments (multiple days)

Three additional experiments confirm other results presented in the paper—these results are secondary to the key thesis, require significant additional time, and depend on third-party software artifacts that take time and effort to set up:

  • Fault-free performance: Fractal achieves performance that rivals that of state-of-the-art systems (§6.1, Fig. 4 and Fig. 5). Confirming this result requires running other software artifacts, some of which are tricky to set up and run, including the DiSh research prototype and mostly-unmaintained Hadoop Streaming.
  • Dynamic output persistence: Fractal strikes a subtle balance between accelerated fault recovery and low overhead during fault-free execution (§6.3, Fig. 8). This is shown using microbenchmarks, but the the earlier result confirmes the best possible configuration for each experiment.
  • Hard faults: The paper also includes experiements of full machine shutdowns (literally bringing down the entire Cloudlab node, not just the Fractal process tree); this requires significant time and effort, for results that are mostly identical to the ones confirmed earlier.

Fault-free execution (3.5 hours)

Fractal also delivers near state-of-the-art performance in failure-free executions compared to DiSh and Apache Hadoop Streaming (§6.1, Fig. 4 and Fig. 5).

To run all the benchmarks with --small input from the control node for each cluster:

# open the interactive shell for the client container
sudo docker exec -it docker-hadoop-client-1 bash

# enter the eval folder
cd $FRACTAL_TOP/evaluation

# run fautless 
bash run_faultless.sh --small

Generating the plots requires data from both clusters. To parse the per-cluster results, run the following command with --site 4 for the 4-node cluster or --site 30 for the 30-node cluster:

# parse results for a single cluster
./plotting/scripts/parse.sh --site 4
# or --site if on the 30-node cluster
# ./plotting/scripts/parse.sh --site 30

After parsing results from both clusters, run the following command on any control node to generate the final figures by aggregating the results:

# generate the plots
./plotting/scripts/plot.sh ms1044.utah.cloudlab.us ms1029.utah.cloudlab.us

Once the script completes, follow its prompt open the following URLs in a browser to view the generated figures, for example:

Fig. 4: http://<node>.cloudlab.us/fig4.pdf
Fig. 5: http://<node>.cloudlab.us/fig5.pdf

Dynamic output persistence (30 mins)

Fractal disables output persistence for singular subgraphs and selectively enables it for others based on cost heuristics.

To run the microbenmark for dynamic persistence with --small input from one of the control node, e.g., 4-node cluster:

# open the interactive shell for the client container
sudo docker exec -it docker-hadoop-client-1 bash

# enter the eval folder
cd $FRACTAL_TOP/evaluation

# There are two options here, either use --small or --full as an argument to determine the input size.
# To facilitate the review process, we populate the data using `bash inputs_all.sh --small` (~20 minutes)
# Optionally, reviewers can run `bash inputs_all.sh` to clean up and regenerate all data from scratch.
bash run_microbench.sh --small

Generating the plots requires data from both clusters. To parse the per-cluster results, run the following command either with --site 4 for the 4-node cluster (microbenchmark is only run on the 4-node cluster):

# Parse results for a single cluster
./plotting/scripts/parse.sh --site 4

After parsing results from both clusters, run the following command on any control node to generate the final figures by aggregating the results:

# Generate the plots
./plotting/scripts/plot.sh ms1044.utah.cloudlab.us ms1029.utah.cloudlab.us

Once the script completes, follow its prompt open the following URLs in a browser to view the generated figures, for example:

Fig. 8: http://ms1044.utah.cloudlab.us/fig8.pdf

Hard faults (manual efforts)

Optionally, you may try to introduce hard faults. However, despite its conceptual simplicity, introducing and monitoring hard faults requires significant time and effort.

As shown at the bottom of page 10, replicating the presented hard faults experiment involves 3 completion percents × 3 system configs (AHS, regular , merger ) × 2 failure modes × 5 repetitions × 3 benchmarks = 270 experiments, which took about a week of manual effort.

The procedures are listed below (let's set the experiment config for classics/top-n.sh, fault at 50%, merger fault):

  1. Prerequisites: set up a cloud deployment for Fractal
  2. Follow the Exercisability section of the instruction file to enter the interactive shell for the client node
  3. Set up benchmark input: cd $FRACTAL_TOP/evaluation/classics; ./inputs.sh
  4. To simplify the experiment, comment out all lines from L23-32 except for L25 in run.sh file to run only top-n
  5. Run the fault-free execution to record the fault-free time: ./run.sh
  6. Collect the ip address for all other remote nodes' datanode container. One simple way is to do hostname -i
  7. When the fault-free run is complete, run ./run.sh again and start a timer on the side
  8. During scheduling the worker-manager stores the IP of the machine that will run the merger sub-graph in a small helper file. cat $PASH_TOP/compiler/dspash/hard_kill_ip_path.log contains two lines:
  • Line 1 -> IP (or hostname) of the merger node.
  • Line 2 -> IP of one regular (non-merger) nod
  1. Now when the timer has reached 0.5*{fault-free time}, shutdown the remote node corresponding to the merger node's ip. If you are using a cloudlab deployment, one way to do so is through cloudlab's web console for the corresponding log. Click the corresponding GUI and select the "terminate" option for non-graceful shutdown
  2. When Fractal detects, recovers, and eventually completes this run (in the client's container), reboot the just-shutdown node, and wait until it's back up
  3. To make sure it is back and stable, we need to check whether all of its data blocks are back online (i.e., whether replication factor is satisfied)

Appendix: Input locations

The Fractal project uses some of the Koala benchmarks (Usenix ATC'25 paper, website, full inputs), thus uses some of the inputs permanantely stored by the Koala authors: 1M, dictionary, books, Bible, Exodus, Gutenberg, PCAP, nginx logs, wav file, small jpg files, full jpg files, unix50 inputs, COVID small, COVID full, NOAA data.