New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-2170: Updated Fixes For Windows Platform #4431
Conversation
Special treatment is needed for the deletion and renaming of log and index files on Windows, due to the latter's general inability to perform those operations while a file is opened or memory mapped. The changes in this commit are essentially adapted from: apache#154 More detailed background information on the issues can also be found via that link.
This commit addresses an edge case with compaction and asynchronous deletion of log files initially encountered when debugging: LogCleanerTest.testRecoveryAfterCrash failures on Windows. It appears that troubles arise when compaction of logs results in two segments having the same base address, hence the same file names, and the segments get scheduled for background deletion. If one segment's files are pending deletion at the time the other segment's files are scheduled for deletion, the file rename attempted during the latter will fail on Windows (due to the typical Windows issues with open/memory-mapped files). It doesn't appear like we can simply close out the former files, as it seems that kafka intends to have them open for concurrent readers until the file deletion interval has fully passed. The fix in this commit basically sidesteps the issue by ensuring files scheduled for background delete are renamed uniquely (by injecting a UUID into the filename). Essentially this follows the approach taken with LogManager.asyncDelete and Log.DeleteDirSuffix. Collision related errors were also observed when running a custom stress test on Windows against a standalone kafka server. The test code caused extremely frequent compaction of the __consumer_offsets topic partitions, which resulted in collisions of the latter's log files when they were scheduled for deletion. Use of the UUID was successful in avoiding collision related issues in this context.
When a sanity check failure was detected by log recovery code, an attempt to delete index files that were memory-mapped would lead to a crash on Windows. This commit adjusts the code to unmap, delete, recreate, and remap index files such the recovery can continue. Issue was found via the LogTest.testCorruptLog test
Fix exceptions encountered when running automated tests on Windows, arising from the latter's issues with removal/renaming of opened/memory-mapped files. Includes changes adapted from: apache#154
Log manager code would fail to close an open lock file if it was unsuccessful in actually locking the file. On Windows this would lead to automated test failures as cleanup code attempted to remove the directory while the lock file remained open. This could be seen for example in: LogManagerTest.testTwoLogManagersUsingSameDirFails
Discovered when running on Windows virtual machines
JAAS config file written by test code contained a keytab location which caused numerous SASL tests to fail. Issue was the keytab location containing backslashes on Windows, which were escaped when writing out the the JAAS config. Switching over to Unix forward slashes resolved the issue.
Some suppressions were not being obeyed, resulting in checkstyle failures on Windows but not linux. Root issue is described here: http://rolf-engelhard.de/2012/11/using-checkstyles-suppression-filters-on-windows-and-linux/
Resolved conflicts
* Fix async deletion sequence - FileChannel for the topic was open, Windows will throw ADE * Adapted tests and clean-up
@GeorgeCGV |
@@ -59,6 +62,7 @@ class EmbeddedZookeeper() extends Logging { | |||
|
|||
Iterator.continually(isDown()).exists(identity) | |||
|
|||
txnLog.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just call zookeeper.getTxnLogFactory.close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter much; it will be the same obj. ref anyway
* TODO: LogCleanerTest.testRecoveryAfterCrash fails
@jasonaliyetti |
They are separate issues but the end result is an unsuccessful build on Windows. |
Hi, I have created a new pull request. See #12331 |
This is continuous development of original pull request made by nxmbriggs404.
I am using Kafka primary on Windows platform; but KAFKA-1194 affects all Kafka versions on Windows and makes it unusable without restarts. There are several pull requests available to fix this problem but most of them are outdated. Nxmbriggs404 pull request is the most recent of them. Unfortunately, there is no further discussion and progress on forementioned pull request, so I created a new one.
Pull request contains the most recent merge of apache/kafka trunk branch into original pull request made by nxmbriggs404 with all related tests and code part adaptations.
The only major change is in the
LogManager.scala
file withinasyncDelete
. I had to change operations order forremovedLog
fromto
FileChannel
to avoidAccessDeniedException
on WindowsNew order allows
LogManagerTest.testFileReferencesAfterAsyncDelete
to succeed on Windows.Currently I am running long live test with all of the changes mentioned above. It seems to function as expected: log retention works. Setup is very simple - one broker.
TODO:
LogCleanerTest.testRecoveryAfterCrash
failsThere is potentially more broken parts on Windows... than the original bug
I think most Windows users would really like to see the old KAFKA-1194 issue being closed.