So there I was, editing my personal writing journal. I realized that the file had somehow lost a large chunk of data, and only had the last few entries. My backups had the same information, so I was staring at six months of data loss on an important personal file.
This post covers how I got the data back, good as new.
Is the disk bad?
The first thing I did was to search through the hard drive to see if the text was lying around somewhere in another file or I had deleted a vim swap file. When you use vim (in my case, gvim), by default it creates a file when you are editing it. Then, if the machine goes down while you are editing it, you can restore the file from the autosaved swap file. Unfortunately, I didn’t find the data on my hard drive after searching around.
The disk itself didn’t seem to be having any problems. There were no audio indications of failure, and no other files missing to the best of my knowledge. However, I figured that I should run a disk check to ensure that I wasn’t dealing with the early stages of more widespread data loss. Tools like fsck require that you run them when the drive is not mounted, so I needed to find a way to unmount the drive and run it. Since I was running Ubuntu, I found a helpful command:
sudo touch /forcefsck
This tells Ubuntu to run fsck on the next startup, before the file system is mounted. I ran this, and the file system appeared to be the same.
So how did I lose data?
The file is my 2011 journal, and I really wanted to get it back. My best guess as to how this happened was:
I somehow delete most of the file
I unwittingly save the file, and go to bed for the night
My mirrored backup runs, syncing the bad version of the file to my shared server
I look at the file and see it’s corrupt, and that the backup is corrupt as well
An alternate version includes some sort of corruption or power failure at a critical time that did something nasty.
I knew that the nightly mirroring operation had a failure mode in the event that I deleted or modified a file and wanted to get at the previous contents, but the effort to do a more complicated backup did not seem worth it entirely. I just wanted something to protect against catastrophic data loss (fire, hard disk crash) or something to handle minor goofs. In retrospect, I probably spent as much time restoring the file as I could have just coming up with a better backup solution.
Knowing a bit about how file systems and hard drives worked, at this point I reasoned that the portion of the file that I deleted might still be around somewhere in an unindexed area of the hard drive. Basically the file system has a list of pointers to blocks, and if one of those pointers got messed up, or an old version of the file was lying around, I might be able to get at least some of the file back. Plus, I had some small snippets still in Gmail, so it wouldn’t be a complete wash either.
If it was the case that there was recoverable data on the hard drive, I didn’t want to write to it in case I overwrote that data. Hence, I needed to search the hard drive, preferably without mounting it.
Scalpel
I messed around with USB installations of Ubuntu, but this was a rabbit hole. Ended up burning an Ubuntu Live CD, and I must say, it was fantastic working with it.
From here, the tool I primarily used to recover the data was Scalpel, which is a tool to look through the hard drive for patterns of characters that delimit the start (and optionally, stop) of what I might be looking for. This allows you to go through the entire disk, not just what the operating system thinks is there.
Scalpel operates in two passes. First, it looks at the whole hard drive for any of the start markers. It remembers these, and then on the second pass it just goes to the start markers and takes the lesser of:
the number of bytes that you say to stop at, OR
an ending delimiter if you proided one
It has some clever backtracking in case you know the end of the file and not the beginning, but this is not the place for explanation of that (mostly because I didn’t use it myself.) See the documentation.
Anyway, I knew a line near the very beginning of the file, so figured this would be a good place to start. One nice thing about this file was that it had date/time stamps throughout it in the format “YYYYMMDD - HHmm” on lines by themselves. In this way, I could easily see what portions of the file that I had recovered and make sure that I didn’t miss anything.
Scalpel doesn’t search for anything at first. You need to specify what to look for in a configuration file. The global configuration file (fine for now) is at /etc/scalpel/scalpel.conf. The line below is one that was given as an example in the configuration file.
# txt y 100000 -----BEGIN\040PGP
I uncommented this line and let Scalpel do its work with:
$ sudo scalpel /dev/sda1
It printed a status bar, and after an hour or so dumped the results to a subdirectory of the current working directory.
I looked at the files it found, and they all started with “—–BEGIN”, so were basically PGP keys. This wasn’t what I wanted (I thought it would return me all text files.) It was a useful experiment though, as I was able to then refine the query to:
txt y 100000 This\sis\sthe\spersonal
The “txt” was not important at all really, it could have been
foo y 100000 This\sis\sthe\spersonal
The first field is basically just what file extension you want to put on these fragments. Many of them would be marked as binary files, which was not a problem. I forget what the second field is. The third is the number of bytes to look for after you find the starting point. This should be about as big as you think the file is when you are in recovery mode so that you don’t exceed your space to write the output files. This happened to me once, and I needed to change the size to be smaller. No files were outputted, so it was a waste of an hour of Scalpel time.
I ran the command again, and it output a bunch of files. At this point, I then took all of the files and found the ones with the most useful fragments, and then by hand compiled them. Obviously this was made a lot easier by having time/date stamps all through the file. Another thing that made my recovery easier was the fact that Vim had written swap files all over the place, and then I typically deleted them upon finding them. So they were around on the hard disk, but not accessible. If I only had one version of the file, it might have been tougher, but still doable.
I made it look pretty simple here, but it took awhile in aggregate due to the long running times and therefore slow feedback cycle of running Scalpel. One tip is that you can email yourself when the command completes:
$ sudo scalpel /dev/sda1 && (echo "done" | mail -s "done" your@email.com)