Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

Journal READ Corruption Issue #52

Open
doapp-nick opened this issue Jan 16, 2014 · 9 comments
Open

Journal READ Corruption Issue #52

doapp-nick opened this issue Jan 16, 2014 · 9 comments

Comments

@doapp-nick
Copy link

I'm seeing unpredictable but consistent corruption of journal files.  The point of corruption is always a READ line in the journal.  I'm trying to wrap my head around how this could happen and wondering if anyone has suggestions.

I've created an object cache that wraps the DiskLruCache.  Here are my put and get methods:

public Object get(String key) throws IOException, ClassNotFoundException {
        Object result = null;
        String sanitizedKey = getDiskHashKey(key);
        if(cache != null){
            DiskLruCache.Snapshot snapshot = cache.get(sanitizedKey);
            if (snapshot != null) {
                InputStream inputStream = snapshot.getInputStream(0);
                InputStream buffer = new BufferedInputStream(inputStream);
                ObjectInput input = new ObjectInputStream(buffer);
                result = input.readObject();
                snapshot.close();
            }
        }
        return result;
    }

public void put(String key, Object value) throws IOException {
        String sanitizedKey = getDiskHashKey(key);
        DiskLruCache.Editor creator = cache.edit(sanitizedKey);
        if(creator != null){
            try{
                ObjectOutputStream objOutStream = new ObjectOutputStream(new BufferedOutputStream(creator.newOutputStream(0)));
                objOutStream.writeObject(value);
                objOutStream.close();
                creator.commit();
            }catch(Exception e){
                e.printStackTrace();
                creator.abort();
            }
        }
    }

My caches are on the order of 3,000 objects when i start seeing the behavior.  This object cache is accessed by multiple threads but it looks to me like the DiskLruCache itself is doing the necessary synchronization.  I've seen the cache get wiped out several times, and I've managed to capture a cache in the state where it has an invalid entry:

READ 2013-reg-c8d10fe6-8c07-4b33-8d03-b7addf69038d
READ 2013-reg-191ff930-47e3-48b0-92f2-1da067e982d4
READ 2013-reg-b6ff7fd9-7da5-4f71-946f-e5d560060794
RREAD 2013-pst-72fe30af-b568-48cd-8582-bfbaa0e2ae6b
READ 2013-pst-5440007e-4f48-46c0-bec2-3ab4779a4c3e
READ 2013-pst-0117fbeb-a140-43ba-8dd2-4b7fc0f55211

In every case where the cache has become corrupted, the behavior is similar.  Part of a READ entry is written over with another READ. This isn't at the very end of the cache state, its about 85% of the way through the cache.

@swankjesse
Copy link
Collaborator

Which OS and filesystem? I'd like to try to reproduce in the same environment!

@doapp-nick
Copy link
Author

OSX 10.9.1 - Mac OS Extended (Journaled)

java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

@doapp-nick
Copy link
Author

Just hit it again after I had to kill the java process.

READ 2013-reg-313-fa2d6b37-b7d5-4d5c-b3READ 2013-reg-338-50668ff9-c152-4ddb-888d-236f4321f147

I am having to 'restart' (basically kill the jvm) the server pretty regularly during development. Could that be contributing to this? Maybe something with the BufferedWriters and the termination sequence?

@swankjesse
Copy link
Collaborator

In theory the cache should be able to recover from an incomplete journal.

  @Test public void openWithTruncatedLineDiscardsThatLine() throws Exception {
    cache.close();
    writeFile(getCleanFile("k1", 0), "A");
    writeFile(getCleanFile("k1", 1), "B");
    Writer writer = new FileWriter(journalFile);
    writer.write(MAGIC + "\n" + VERSION_1 + "\n100\n2\n\nCLEAN k1 1 1"); // no trailing newline
    writer.close();
    cache = DiskLruCache.open(cacheDir, appVersion, 2, Integer.MAX_VALUE);
    assertThat(cache.get("k1")).isNull();
  }

I suspect the problem is that we recover from the incomplete journal, but we don't delete the truncated line before starting to write new lines into the file?

@doapp-nick
Copy link
Author

Yeah, as it stands, it looks like any time an incomplete or junk line exists, the entire cache gets destroyed. I suppose this is because a cache of unknown state is not super useful, but in my use case, im not sure how to avoid this scenario.

@rupertbates
Copy link

I'm seeing a variation on this issue which is happening on Android. The behaviour is the same in that there is a corrupted line in the journal and that is causing the cache to be deleted on opening. In my case the line looks like this:

READ 96644ef5b41c75bd081e1737bREAD e9211ead93e7e0c88e4f79b0d5af4a45
Ie. it has 2 keys on the line.

This causes the method readJournalLine() to throw when it encounters it at which point the whole cache gets deleted. It would be much more use to me to just drop the invalid line rather than deleting the whole cache (particularly if it is just a read that is corrupted), can anyone see any reason why this wouldn't work?

@mikeklein
Copy link

Late to the game here but am seeing same issue.

My android application will at some point delete my entire cache and start from scratch! This particular cache has no methods exposed to remove entries so they should always be there. I appear to be hitting this "corrupt journal" issue and my cache is getting nuked.

I'm unclear on what is causing this state or how to fix it.

@mikeklein
Copy link

mikeklein commented Sep 9, 2020

I finally caught issue when occuring in debugger. What's very odd is if I pause code before cache.delete and use debugger to issue call to rebuild the journal and then restart...my cache then comes up without issues yet there's a "bad" line left in journal. It's unclear why the journal is set to rebuild itself based on directory contents...or why nuking whole cache is preferable to merely missing an entry.

I added a line of code to flush cache after my writes and this isn't solving the issue. I the corruption occurred why my app was idle and I killed the task (do this regularly during debugging of course).

On first seeing corrupt journal and opening file I see:
READ popegregoryiii
READ popezachary
READ popeelectstephen
READ popestepheREAD curricula
READ authors
READ blackpower
READ conquerors
READ foundingfathers
READ greatwomen

After rebuild of journal (and success bringing up cache) I see following corruption still in journal (bad line moved to end of journal):

CLEAN popegregoryiii 135
CLEAN popezachary 127
CLEAN popeelectstephen 139
CLEAN popestepheREAD 0

Would appreciate any ideas. Like others I have multiple directories (for each cache) and there may be a few thousand files in each directory.

@Jeanno
Copy link

Jeanno commented Apr 8, 2023

Looks like this is a rather old problem and it still persists today. After taking a look at the code I might have some idea on the root cause of this problem.

Assuming you don't have multiple instances of the DiskLRU on the same dir (across processes / threads), there's still a chance where the journal file can get corrupted.

This is because the operations on the primary journal file are not atomic. For example this line operates directly on the primary journal file.

journalWriter.append(READ + ' ' + key + '\n');

If the process crashes or gets killed when journalWriter is writing to the file, which is not uncommon, the journal file will get corrupted immediately. And therefore, when DiskLRU detects there's a corruption in the journal, it removes the entire cache.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants