WAYBACK-MACHINE-DOWNLOADER-COMPANION

Python 3 scripts that complements the hartator/wayback-machine-downloader software output.

I made these following scripts at one of my dearest friend's request. For a bit of context, an (arguably) old website has been decommissioned fairly recently (09/2023 ~ 10/2023), and all its content (as it turned out, 'most' is a better word in that case) can now only be found on the famous Wayback Machine - Internet Archive. This is a good thing because its content still exists (mostly), but somewhat worrying as we don't know for how long, and pretty annoying because browsing through the Wayback Machine - Internet Archive is a slow process (some request give an answer only after 5 seconds or more).

With the use of wayback-machine-downloader, I was able to download what turned out to be about 7% of the whole website with this software alone. Since wayback-machine-downloader allows the download of single URL, I made the following scripts that helped me with finding all the URLs that were 'locally' missing, and fed them to wayback-machine-downloader to download said missing files. (I had to repeat this process a couple of times).

VERSIONS

0.1.0-alpha: First release

TABLE OF CONTENT

WAYBACK-MACHINE-DOWNLOADER-COMPANION

INSTALL GUIDE

Install the wayback-machine-downloader software. You will need RubyGems.

For Python 3 installation, consult the following link No other dependencies needed, I wanted to only use the Python 3 standard library for this small project.

START GUIDE

Don't forget to adjust the variables WEB_FOLDER and WEB_OUTPUT in the config file config.json accordingly to your needs.

Rename the current folder Wayback-Machine-Downloader-Companion into websites
Open a terminal / command prompt, go in the parent's directory of the freshly renamed folder websites, and run wayback-machine-downloader

CASE 01

Assuming you ran wayback-machine-downloader with the following basic command line in the parent directory:

wayback_machine_downloader http://example.com

Run the following commands until there is nothing else found to download

python3 find_missing_ressource.py

python3 download_missing_ressource.py

CASE 02

If you ran wayback-machine-downloader with the following command line in the parent directory:

wayback_machine_downloader http://example.com -s

Run the following command, and then proceed to CASE 01

python3 merge_snapshots.py

WARNINGS

Some websites are fairly old, and their 'textual' content may not have been saved with utf-8 encoding... So you can find some strange characters in your files, or get some errors from my scripts because of that.
These scripts 'just worked' for me (albeit with a few tweaks here and there)... So it may not be 100% tailored to your own needs.

Best of luck, and I hope this helped.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.vscode		.vscode
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
download_missing_ressource.py		download_missing_ressource.py
find_missing_ressource.py		find_missing_ressource.py
merge_snapshots.py		merge_snapshots.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WAYBACK-MACHINE-DOWNLOADER-COMPANION

VERSIONS

TABLE OF CONTENT

INSTALL GUIDE

START GUIDE

CASE 01

CASE 02

WARNINGS

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WAYBACK-MACHINE-DOWNLOADER-COMPANION

VERSIONS

TABLE OF CONTENT

INSTALL GUIDE

START GUIDE

CASE 01

CASE 02

WARNINGS

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages