Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-2170: Updated Fixes For Windows Platform #4431

Closed
wants to merge 13 commits into from

Conversation

GeorgeCGV
Copy link

@GeorgeCGV GeorgeCGV commented Jan 17, 2018

This is continuous development of original pull request made by nxmbriggs404.

I am using Kafka primary on Windows platform; but KAFKA-1194 affects all Kafka versions on Windows and makes it unusable without restarts. There are several pull requests available to fix this problem but most of them are outdated. Nxmbriggs404 pull request is the most recent of them. Unfortunately, there is no further discussion and progress on forementioned pull request, so I created a new one.

Pull request contains the most recent merge of apache/kafka trunk branch into original pull request made by nxmbriggs404 with all related tests and code part adaptations.

The only major change is in the LogManager.scala file within asyncDelete. I had to change operations order for removedLog from

  1. Rename directory (atomic move)
  2. Make checkpoint log for offsets and recovery
  3. Mark for deletion

to

  1. Make checkpoint log for offsets and recovery
  2. Close FileChannel to avoid AccessDeniedException on Windows
  3. Rename directory (atomic move)
  4. Mark for deletion

New order allows LogManagerTest.testFileReferencesAfterAsyncDelete to succeed on Windows.

Currently I am running long live test with all of the changes mentioned above. It seems to function as expected: log retention works. Setup is very simple - one broker.

TODO:

  • LogCleanerTest.testRecoveryAfterCrash fails

There is potentially more broken parts on Windows... than the original bug

I think most Windows users would really like to see the old KAFKA-1194 issue being closed.

nxmbriggs404 and others added 12 commits June 9, 2017 10:53
Special treatment is needed for the deletion and renaming of log and
index files on Windows, due to the latter's general inability to
perform those operations while a file is opened or memory mapped.

The changes in this commit are essentially adapted from:

apache#154

More detailed background information on the issues can also be found
via that link.
This commit addresses an edge case with compaction and asynchronous
deletion of log files initially encountered when debugging:

LogCleanerTest.testRecoveryAfterCrash

failures on Windows.  It appears that troubles arise when compaction
of logs results in two segments having the same base address, hence
the same file names, and the segments get scheduled for background
deletion.  If one segment's files are pending deletion at the time the
other segment's files are scheduled for deletion, the file rename
attempted during the latter will fail on Windows (due to the typical
Windows issues with open/memory-mapped files).  It doesn't appear like
we can simply close out the former files, as it seems that kafka
intends to have them open for concurrent readers until the file
deletion interval has fully passed.

The fix in this commit basically sidesteps the issue by ensuring files
scheduled for background delete are renamed uniquely (by injecting a
UUID into the filename).  Essentially this follows the approach taken
with LogManager.asyncDelete and Log.DeleteDirSuffix.

Collision related errors were also observed when running a custom
stress test on Windows against a standalone kafka server.  The test
code caused extremely frequent compaction of the __consumer_offsets
topic partitions, which resulted in collisions of the latter's log
files when they were scheduled for deletion.  Use of the UUID was
successful in avoiding collision related issues in this context.
When a sanity check failure was detected by log recovery code, an
attempt to delete index files that were memory-mapped would lead to
a crash on Windows.  This commit adjusts the code to unmap, delete,
recreate, and remap index files such the recovery can continue.

Issue was found via the LogTest.testCorruptLog test
Fix exceptions encountered when running automated tests on Windows,
arising from the latter's issues with removal/renaming of
opened/memory-mapped files.

Includes changes adapted from:

apache#154
Log manager code would fail to close an open lock file if it was
unsuccessful in actually locking the file.  On Windows this would lead
to automated test failures as cleanup code attempted to remove the
directory while the lock file remained open.  This could be seen
for example in:

LogManagerTest.testTwoLogManagersUsingSameDirFails
Discovered when running on Windows virtual machines
JAAS config file written by test code contained a keytab location
which caused numerous SASL tests to fail.  Issue was the keytab
location containing backslashes on Windows, which were escaped when
writing out the the JAAS config.  Switching over to Unix forward
slashes resolved the issue.
Some suppressions were not being obeyed, resulting in checkstyle
failures on Windows but not linux.  Root issue is described here:

http://rolf-engelhard.de/2012/11/using-checkstyles-suppression-filters-on-windows-and-linux/
* Fix async deletion sequence
   - FileChannel for the topic was open, Windows will throw ADE
* Adapted tests and clean-up
@jasonaliyetti
Copy link

@GeorgeCGV
I'm looking at the same issue right now and I'm hitting an issue with the tests around transactions on Windows. There seems to be some sort of bug in TransactionMarkerChannelManager that's causing this to hang in TransactionsTest.testBasicTransactions. Have you run into this as well?

@@ -59,6 +62,7 @@ class EmbeddedZookeeper() extends Logging {

Iterator.continually(isDown()).exists(identity)

txnLog.close()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just call zookeeper.getTxnLogFactory.close()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter much; it will be the same obj. ref anyway

* TODO: LogCleanerTest.testRecoveryAfterCrash fails
@GeorgeCGV
Copy link
Author

@jasonaliyetti
Indeed, transaction tests seems to be broken.
Based on your previous comments in PULL-3283 it looks like they are not related to the KAFKA-1194 log rotation problem on Windows platform.

@jasonaliyetti
Copy link

jasonaliyetti commented Jan 26, 2018

They are separate issues but the end result is an unsuccessful build on Windows.

@simplesteph simplesteph mentioned this pull request Sep 3, 2018
3 tasks
@GeorgeCGV GeorgeCGV closed this Jan 28, 2019
@MPeli
Copy link

MPeli commented Aug 22, 2022

Hi, I have created a new pull request. See #12331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants