2.1 backport logrecovery#6395
Conversation
…ata apache#4873 This commit makes two major changes. First it changed log recovery to use block caches. Second it checks if a tablet has any data in walogs before acquiring the recovery lock. These two changes together really speed up loading tablets that have no data in walogs. These changes introduce an extra opening of the walogs to see if the recovery lock needs to be acquired. Using the block caches for this extra opening should avoid any extra cost. The block caches also help in the case where many tablets with the same walogs are assigned to a tablet server. In some simple test saw an 8x speedup in tablet load times. Anytime a tablet has an unclean shutdown it will have the walogs of the dead tserver assigned to it even if had no data in those walogs. These change make loading tablets in that situation much faster. {"fundingSource": "41201", "team": "FED.ICGSA.OPS.MOE", "fshGit": "dummy-lo", "fshDocker": "sha256:20cf0045"}
In apache#4873 a check was added to inspect walogs during tablet load to see if they had any data for the tablet. This check happens prior to volume replacement that also runs during tablet load. Therefore if volume replacement is needed for the walogs then this check will fail because it can not find the files and the tablet will fail to load. To fix this problem modified the new check to switch volumes if needed prior to running the check. {"fundingSource": "41201", "team": "FED.ICGSA.OPS.MOE", "fshGit": "dummy-lo", "fshDocker": "sha256:20cf0045"}
… log recovery. (apache#4874) The log recovery code would list the sorted walog files multiple times during recovery. These changes modify the code to only list the files once. Also the listing is cached for a short period of time to improve the case of multiple tablet referencing the same walogs. This along with apache#4873 should result in much less traffic to the namenode when an entire accumulo cluster shutsdown and needs to recover. {"fundingSource": "41201", "team": "FED.ICGSA.OPS.MOE", "fshGit": "dummy-lo", "fshDocker": "sha256:20cf0045"}
909d002 to
8333a39
Compare
|
I ran all the ITs on this branch. There are no tests failing on this branch that are not also failing on the 2.1 branch. |
| return null; | ||
| } | ||
|
|
||
| public static LogEntry switchVolume(LogEntry le, List<Pair<Path,Path>> replacements) { |
There was a problem hiding this comment.
I think the better solution would be to just make the method below public.
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| import com.google.common.cache.Cache; |
There was a problem hiding this comment.
I think this should be com.github.benmanes.caffeine.cache.Cache in all new imports
There was a problem hiding this comment.
Ok, I broke it up into two commits one for RecoveryLogsIterator, which touched 15 files, and the other for the rest of the files using com.google.common.cache.Cache (9 src, 1 test and the manager pom.xml)
There was a problem hiding this comment.
I've squashed the last 3 commits.
There was a problem hiding this comment.
If there is anything else that needs to be done, please let me know. This is my first time contributing to any opensource project and I apologize in advance.
9330562 to
95a2651
Compare
|
The build is failing because some of our formatting checks are failing. If you run |
added @CanIgnoreReturnValue to Combiner sawDelete
95a2651 to
1384261
Compare
|
Thanks! Obviously didn't see in GitHub what the issue was. And the import alphabetize plugin is a pretty slick feature. |
closes #4887
Saw this issue #4887 and thought I could help.
And using AI's help I got it to pass the unit tests.
These are the Integration Tests that got stuck or timed out.
org.apache.accumulo.test.fate.zookeeper.FateIT never returned, but ran it again and it passed
org.apache.accumulo.test.tracing.ScanTracingIT timed out twice
org.apache.accumulo.test.functional.MetadataMaxFilesIT timed out twice
org.apache.accumulo.test.functional.TimeoutIT timed out twice
org.apache.accumulo.test.functional.TServerShutdownOptimizationsIT timed out twice
org.apache.accumulo.test.functional.KerberosIT timed out twice
org.apache.accumulo.test.shell.ShellServerIT never returned, but ran it again and it passed twice
[ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 166.9 s <<< FAILURE! -- in org.apache.accumulo.test.functional.KerberosIT
[ERROR] org.apache.accumulo.test.functional.KerberosIT.testGetDelegationTokenDenied -- Time elapsed: 14.48 s <<< ERROR!
java.lang.IllegalStateException: org.apache.hadoop.security.KerberosAuthException: failure to login: using ticket cache file: FILE:/tmp/krb5cc_911602271_DwR0dF javax.security.auth.login.LoginException: java.lang
.IllegalArgumentException: Illegal principal name ajmcdonald@CCRI.COM: org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to ajmcdonald@CCRI.COM
I would like to deploy it to one of our dev environments and do some ingest testing but haven't gotten to it yet.