Borg – Deduplicating archiver with compression and encryption

100 points by rubyn00bie 15 hours ago

I switched to restic (https://restic.net/) and the backrest webui (https://github.com/garethgeorge/backrest) for Windows support. Files are deduplicated across machines with good compression support.

jeltz 6 hours ago

One big advantage of using restic is that its append only storage actually works unlike for Borg where it is just a hack.
sureglymop 9 hours ago

I also use restic and do backups to append-only rest-servers in multiple locations.
I also back up multiple hosts to the same repository, which actually results in insane storage space savings. One thing I'm missing though is being able to specify multiple repositories for one snapshot such that I have consistency across the multiple backup locations. For now the snapshots just have different ids.
- linsomniac 4 hours ago
  
  >back up multiple hosts to the same repository
  I haven't tried that recently (~3 years), does that work with concurrency or do you need to ensure one backup is running at a time? Back when I tried it I got the sense that it wasn't really meant to have many machines accessing the repo at once, and decided it was probably worth wasting space but having potentially more robust backups. Especially for my home use case where I only have a couple machines I'm backing up. But it'd be pretty cool if I could replace my main backup servers (using rsync --inplace and zfs snapshots) with restic and get deduplication.
  - sureglymop 2 hours ago
    
    It works. In general, multiple clients can back up to/restore from the same repository at the same time and do writes/reads in parallel. However, restic does have a concept of exclusive and non-exclusive locks and I would recommend reading the manual/reference section on locks. It has some smart logic to detect and clean up stale locks by itself.
    Locks are created e.g. when you want to forget/prune data or when doing a check. The way I handle this is that I use systemd timers for my backup jobs. Before I do e.g. a check command I use an ansible ad-hoc command to pause the systemd units on all hosts and then wait until their operations are done. After doing my modifications to the repos I enable the units again.
    Another tip is that you can create individual keys for your hosts for the same repository. Each host gets its own key so that host compromise only leads to that key being compromised which can then be revoked after the breach. And as I said I use rest-servers in append-only mode so a hacker can only "waste storage" in case of a breach. And I also back up to multiple different locations (sequentially) so if a backup location is compromised I could recover from that.
    I don't back up the full hosts, mainly application data. I use tags to tag by application, backup type, etc. One pain point is, as I mentioned, that the snapshot IDs in the different repositories/locations are different. Also, because I back up sequentially, data may have already changed between writing to the different locations. But this is still better than syncing them with another tool as that would be bad in case one of the backup locations was compromised. The tag combinations help me deal with this issue.
    Restic really is an insanely powerful tool and can do almost everything other backup tools can!
    The only major downside to me is that it is not available in library form to be used in a Go program. But that may change in the future.
    Also, what would be even cooler for the multiple backup locations, is if the encrypted data could be distributed using e.g. something like shamir secret sharing where you'd need access to k of n backup locations to recreate the secret data. That would also mean that you wouldn't have to trust whatever provider you use to back up to (e.g. if it's amazon s3 or something).
  - l33tman 4 hours ago
    
    The issue with this is that if someone hacks one of the hosts now they have access to the backups of all your other hosts. With borg at least and the standard setup, would be cool if I was wrong though
    
    sureglymop 2 hours ago
    
    At least with restic that is not an issue. See my other comment here: https://news.ycombinator.com/item?id=44626515
    Backups are append only and each host gets its own key, the keys can be individually revoked.
    Edit: I have to correct myself. After further research, it seems that append-only != write-only. Thus you are correct in that a single host could possibly access/read data backed up by another host. I suppose it depends on use-case whether that is a problem.

kachapopopow 9 hours ago

Restic is far better both in terms of usability and packaging (borgmatic pretty much is a requirement for usability). Have used both extensively, you can argue that borg can just be scripted instead and is a lot more versitile, but I had a much better experience with restic in terms of setup and forget. I am not scared that restic will break, with borg I did.

Also not sure why this was posted, did a new version release or something?

kmarc 8 hours ago
> you can argue that borg can just be scripted
And that's what I did myself. Organically it grew to ~200 lines, but it sits in the background (created a systemd unit for it, too) and does its job. I also use rclone to store the encrypted backups in an AWS S3 bucket
I so much forget about it that sometimes I have to remind myself to test it out if it still works (it does).
```
                           Original size      Compressed size    Deduplicated size
    All archives:                2.20 TB              1.49 TB             52.97 GB
```
mekster 8 hours ago

How is the performance for both?
Last time I used restic a few years ago, it choked on not so large data set with high memory usage. I read Borg doesn't choke like that.
- homebrewer 7 hours ago
  
  Depends on what you consider large; I looked at one of the machines (at random), and it backups about two terabytes of data spread across about a million files. Most of them aren't changing day to day. I ran another backup, and restic rescanned them & created a snapshot in exactly 35 seconds, using ~800 MiB of RAM at peak and about 600 on average.
  The files are on HDD, and the machine doesn't have a lot of RAM, looking at high I/O wait times and low CPU load overall, I'm pretty sure the bottleneck is in loading filesystem metadata off disk.
  I wouldn't backup billions of files or petabytes of data with either restic or borg; stick to ZFS for anything of this scale.
  I don't remember what the initial scan time was (it was many years ago), but it wasn't unreasonable — pretty sure the bottleneck also was in disk I/O.
bjoli 3 hours ago

Pika backup is pretty darn simple.

blablabla123 10 hours ago

I once met the Borg author at a conference, pretty chill guy. He said that when people file bugs because of data corruption, it's because his tool found the underlying disk to be broken. Sounds quite reliable although I'm mostly fine with tar...

vrighter 10 hours ago

I used to work on backup software. I lost count of the number of times this happened to us with our clients too
- ValentineC 6 hours ago
  
  I used CrashPlan in 2014. Back then, their implementation of Windows's Volume Shadow Copy Service (VSS) was buggy, and I lost data because of that. I doubt my underlying disk was broken.
im3w1l 5 hours ago

While saying "hardware issue not my fault not my problem" is a valid stance, I'm thinking that if you hear it again and again from your users, maybe you should consider if you can do more. Verify the file was written correctly is a low hanging fruit. Other possibilities is run some s.m.a.r.t. check and show warning, or adding redundancy to recover from partial failure.
- ddtaylor an hour ago
  
  I think the failure mode that is happening for users/devs here is bit rot. It's not that the device won't report back the same bytes, even if you disable whatever caching is happening, it's that after T amount of time it will report the wrong bytes. Some file systems have "scrubs" and stuff they do to automatically find these and sometimes attempt to repair them (ZFS can do this).

sunaookami 8 hours ago

Love borg, use it to backup all my servers and laptop to a Hetzner Storage Box. Always impressed with the deduplication stats!

stevekemp 8 hours ago

Same story here, using Borg with a Hetzner storage box to give me offsite backups.
Cheap, reliable, and almost trouble-free.

dxs 2 hours ago

Also: Baqpaq

"Baqpaq takes snapshots of files and folders on your system, and syncs them to another machine, or uploads it to your Google Drive or Dropbox account. Set up any schedule you prefer and Baqpaq will create, prune, sync, and upload snapshots at the scheduled time.

"Baqpaq is a tool for personal data backups on Linux systems. Powered by BorgBackup, RSync, and RClone it is designed to run on Linux distributions based on Debian, Ubuntu, Fedora, and Arch Linux."

At: https://store.teejeetech.com/product/baqpaq/

Though personally I use Borg, Rsync, and some scripts I wrote based on Tar.

evulhotdog 2 hours ago

Kopia is an awesome tool that checks the same boxes, and has a wonderful GUI if you need that.

Not affiliated, just a happy user.

ElectronBadger 12 hours ago

I using it with via Vorta (https://vorta.borgbase.com) frontend. My favorite backup solution so far.

Kudos 9 hours ago

Pika Backup (https://apps.gnome.org/PikaBackup/) pointed at https://borgbase.com is my choice.

toenail 12 hours ago

Last time I checked the deduplication only works per host when backups are encrypted, which makes sense. Anyway, borg is one of the three backup systems I use, it's alright.

arendtio 9 hours ago

Which are the others?
- guerby 8 hours ago
  
  https://kopia.io/
- toenail 7 hours ago
  
  backuppc and a shell script using rsync, for backups to usb sticks

bjoli 3 hours ago

They are also a prominent user of aes-ocb iirc.

AnonC 8 hours ago

I’ve been looking at this project occasionally for more than four years. The development of version 2.0 started sometime in April 2022 (IIRC) and there’s still no release candidate yet. I’m guessing that it’ll be finished in a year from now.

What are the current recommendations here to do periodic backups of a NAS with lower (not lowest) costs for about 1 TB of data (mostly personal photos and videos), ease of use and robustness that one can depend on (I know this sounds like a “pick two” situation)? I also want the backup to be completely private.

homebrewer 7 hours ago
You definitely should have checksumming in some form, even if compression and deduplication are worthless in this particular use case, so either use ZFS on both the sending and the receiving side (most efficient, but probably will force you to redo the NAS), or stick to restic.
I've been mostly using restic over the past five years to backup two dozen servers + several desktops (one of them Windows), no problems so far, and it's been very stable in both senses of the word (absence of bugs & unchanging API — both "technical" and "user-facing").
https://github.com/restic/restic
The important thing is to run periodic scrubs with full data read to check that your data can actually be restored (I do it once a week; once a month is probably the upper limit).
```
  restic check --read-data ...
```
Some suggestions for the receiver unless you want to go for your own hardware:
https://www.rsync.net/signup/order.html?code=experts
https://www.borgbase.com
(the code is NOT a referral, it's their own internal thingy that cuts the price in half)

rjh29 6 hours ago

People like to recommend restic but I stay with Borg because it is old, popular and battle tested. Very important when dealing with backing up data!

muppetman 5 hours ago

Restic is hardly new and untested? I don't think they're dissimilar in age. Restic is certainly battle tested. Are you thinking of rustic?

rollcat 9 hours ago

I've been using it for ~10 years at work and at home. Fantastic software.

creamyhorror 12 hours ago

I remember using Borg Backup before eventually switching to Duplicati. It's been a while.

Snild 12 hours ago

I currently use borg, and have never heard of Duplicati. What made you switch?
racked 8 hours ago

I've had an awful experience with Duplicati. Unstable, incomplete, hell to install natively on Linux. This was 5 years ago and development in Duplicati seemed slow back then. Not sure how the situation is now.
- creamyhorror 7 hours ago
  
  Interesting to hear. I use Duplicati on Windows and it's been fine, though I haven't extensively used its features.

TacticalCoder 4 hours ago

I'll die on this hill... If may files that are named like this:

    DSC009847.JPG

were actually named like this:

    DSC009847-b3-73ea2364d158.JPG

where "-b3-" means "what's coming before the extension are the first x bits (choose as many hexdigits as you want) of the Blake3 cryptographic hash of the file...

We'd be living in a better world.

I do that for many of my files. Notably family pictures and family movies, but also .iso files, tar/gzip'ed files, etc.

This makes detecting bitflips trivial.

I've create little shellscripts for verification, backups, etc. that work with files having such a naming scheme.

It's bliss.

My world is a better place now. I moved to such a scheme after I had a series of 20 pictures from vacation with old friends that were corrupted (thankfully I had backups, but the concept of "determining which one is the correct file" programmatically is not that easy).

And, yes, it detected one bitflip since I'm using it.

I don't always verify all the checksums: but I've got a script that does random sampling... It picks x% of the files with such a naming scheme and verifies the checksum of these x% of files picked randomly.

It's not incompatible with ZFS: I still run ZFS on my Proxmox server. It's not incompatible with restic/borg/etc. either.

This solves so many issues, including the "How do you know your data is correct?" (answer is: "Because I've already looked that family movie after the cryptographic hash was added to its name").

Not a panacea but doesn't hurt and it's really not much work.

homebrewer 4 hours ago

It's an old idea and is also how some anime fansub groups prepare their releases: the filename of each episode contains the CRC32 of the file inside [square brackets].
Doesn't really make much sense for BitTorrent uploads (which provides its own much stronger hashes), it's a holdover from the era of IRC bots.