Skip navigation.

Notes to self

Whatever I feel like writing

JDBC Connection Pools, does the world need another?

, , ,

Over the past year or so, I have on occasion been working on a project called NanoPool. It is a JDBC 2 connection pool that exposes itself as a DataSource and is released under the Apache 2 license.

I started writing it after having read Java Concurrency in Practice, because I wanted to experiment with implement something that was both simple, concurrent and lock-free.

This experiment ended up being a JDBC connection pool implementation that is really fast at connection reuse. It is lock-free so internal dead-locks are impossible. It supports hot resizing so you can adjust the pool-size during peak-hours. It has hooks for high contention callbacks so you can be notified if your pool is too small. It has no internal threads so life-cycle management is simpler (and no thread-leaks). It has a feature-full JMX interface so management is theoretically a breeze.

With all these nice things comes a number of caveats. For instance, NanoPool will, per its design, saturate the pool and try to keep all connections open at all times. Most other pools have a maximum and minimum size that bounds how many connections will be kept open at any given time. NanoPool does not do this; it has just a pool size.

Another caveat comes by virtue of not having any internal threads. This means that the only threads that can do pool maintenance work, such as reestablishing aging connections, are the client threads that call into the pool to either claim or release connections. This may be a problem if you worry a lot about latency and standard deviation in response times. That said, I have not really checked if other pools do any better in this respect, so test & check it yourself if this is a concern.

Anyway, the question that is really on my mind is this: NanoPool is almost at 1.0 state, but does it make sense to complete it and ship it? I mean, is it really add something of value and not just noise? I think that people are extremely conservative when picking connection pools so it may be a wasted effort. On the other hand, if I go through and release it then I take on the responsibility of support and maintenance nearly regardless of the user base (although I think there will be less of this work for NanoPool than what I ended up with for Fabric).

In other words: does the world need another connection pool?

Code Need

, , ,

The following is something I intend to sleep on:

Introduce abstractions out of need. Abstractions that are introduced through design rather than need are inherently over-designed.

Programming is about satisfying need.

Agility is about reacting to need rather than anticipating it.

TDD is a process of formulating need at a low level.

Why git-svn is not a real solution

, ,

For the past two months or so, my primary Subversion client has been git-svn. Obviously, I like Git. It is my favorite SCM tool and cheap local branches, coupled with an excellent ability to merge branches, has a lot do with why that is.

So why git-svn? Well, we have historically been standardizing on CVS and Subversion at my work place, and all of our code is versioned by one of these systems. I could mention many things that I dislike about these two SCMs, but their Eclipse integration is pretty good, and that has been politically decided to be a big plus. Therefore, all code I produce at work must be put in either CVS or Subversion, and Git is not under consideration until it can be shown that the EGit Eclipse plug-in has accrued an acceptable amount of awesome. This is why I decided to put my faith in git-svn two months back.

As you might have guessed from the title, things aren't all rosy. It's not that git-svn doesn't work, it's the way it works that is bothering me. Git allows you to manipulate the history of your repository, and when you do a merge in Git, the commits from each branch will be chronically interspersed in between each other. This model does not map very well to Subversion. If I merge a branch into trunk in subversion, I am in essence making one big fat commit to trunk that goes on top of all of the existing commits, and contains all of the changes from the branch. There is no history rewriting because the history is lost by flattening all of these commits into one.

Therefor, to shoe-horn itself into this model, git-svn will use rebasing instead of merging. When you rebase a branch A unto another branch B in git, you are taking all commits from A that are not already in B, and reapply them, in order, to the HEAD of B. This makes it look like these commits where written and committed after all other commits in B.

This is a clever trick but it comes at a pretty high cost. A commit that is retrofitted to have a different past, is no longer the same commit. So what happens when you make some changes in B that you back in A? With ordinary merging in Git, this problem is trivial - you simply merge B back into A. However, because of the rebasing, A and B now contain two sets of commits that are essentially the same changes, but with different past and therefore different commits. This means that merging breaks down. If its just one or two commits we're talking about, then we can cherry-pick. But if we are talking about a larger set of changes, then we quickly find ourselves reaching for the
rebase-katana.

Rebasing is a powerful tool that allows us to rewrite history. However, if we use it too much, we can quickly rewrite ourselves hairy mess of unmergeable branches. Consider this: just as we can get merge failures when two commits are incompatible, we can likewise get rebase failures. But because rebasing is reapplying all distinct commits in order, a rebase failure will affect all subsequent commits waiting in line of the rebasing process. A merge failure is resolved once for a merge, but if you are rebasing 10 commits unto a branch and the third commit fails, then that failure could potentially propagate to the seven other commits, if they depend on the changes in a failing earlier commit.

What are the consequences of this? Because git-svn does a rebase on every git-svn dcommit and git-svn rebase, thus (more or less) rewriting history, I am pretty much left with a git repository where cheap local branches and merging is somewhere between troublesome and non-functioning.

That's a pretty steep cost, in my humble opinion. But I still use it. Yes, despite these dragons and drawbacks, I still use git-svn as my primary Subversion client. I still like my ability to freely mold and modify unpushed history, I still like my git-bisect and I still like colored console output.

I have experienced these pitfalls on my own repositories and I have learned to avoid them. But if you think that you can reap the benefits of Git while keeping ye olden Subversion around, then you're wrong. With git-svn, you are voluntarily pulling a SVN branded straight jacket over your Git repository. To truly get the most out of Git, you absolutely have to sync with a real remote Git repository.

Web Acronyms Test

I took the web acronyms test and the result is bellow (first try):



So... apparently I'm good at remembering acronyms?

Breaking the Law (of TDD)

, ,

The Three Laws of TDD is an excellent and productive method of raising the quality-bar of your code.

This "proper" form of TDD ensures a number of things. First, your code is testable. If it is not, your tests will become too hard to write, but once you have a good coverage, you can then safely refactor your code such that it becomes testable. This is a good thing because testability is a property of design that implies modularity, compose-ability and flexibility.

Second, the closed think-test-code-refactor loop makes your programming intentional and deliberate, as oppose to programming by coincidence. It forces you to think about your code and your design before you get typing. I believe that these extra brain cycles aren't just wasted on trying to figure out how to write a test of non-existing code, but also partially translates into code that is better thought through and, ultimately, better written.

Lastly, the most direct effect of this process is the test coverage gained. If followed strictly, then you are in theory guaranteed that all lines in your code are covered by tests. In reality however, the dreaded 100% coverage is not only overrated, but can lul you into a false sense of security. But, proper test coverage enables refactoring and thus makes it possible to continuously improve the code base.

All nice and well. I try really hard to adhere to the laws of TDD, but there are things that you just can't test. And not just the usability, but code-things too. This is especially true for highly concurrent code where failures are probabilistic and sometimes even impossible under certain conditions (hardware, VMs or compiler flags). I have sort of accepted this property of concurrent code, because I find that I can often encapsulate the concurrent parts and make them simple enough to be verified analytically.

However, the other day, I ran into a problem. I needed to write a function. It did not modify any state, was idempotent, did not involve any concurrency (the whole program was single-threaded) and it did not even have anything to do with usability. But I could not write a test for it.

The purpose of this function was to return the absolute path to the current users home directory. That was all. How do I test that? I could hard-code the path to my own home directory in the test, but then it would fail for everyone else. I could try the function manually and check the result myself, but I don't have a windows machine and this program was suppose to be platform independent. Still, that was what I ended up doing. Then I just copy-pasted some code from somewhere to fill in the windows special case and hope for the best.

It felt like a dirty thing to do, because this program had been written with TDD from the start, and I had just written the first function without any unit tests. Indeed, parts of it wouldn't even run on my machine because it was windows specific.

I won't make it a habit, though. I need to keep telling myself that.

64bit HotSpot hates Eclipse

,

If you find that Eclipse is incredibly crashy on the 64bit Linux HotSpot JVM, then try adding these two lines to your eclipse.ini file:

-XX:CompileCommand=exclude,org/eclipse/core/internal/dtree/DataTreeNode,forwardDeltaWith
-XX:CompileCommand=exclude,org/eclipse/jdt/internal/compiler/lookup/ParameterizedMethodBinding,<init>

I concluded this by scoping out the hs_err_pid* files that HotSpot produces when it crashes, and in these files you can spy something like the following:

Current CompileTask:
C2:484      org.eclipse.core.internal.dtree.DataTreeNode.forwardDeltaWith([Lorg/eclipse/core/internal/dtree/AbstractDataTreeNode;[Lorg/eclipse/core/internal/dtree/AbstractDataTreeNode;Lorg/eclipse/core/internal/dtree/IComparator;)[Lorg/eclipse/core/internal/dtree/AbstractDataTreeNode; (469 bytes)

It basically says that the JIT crashes trying to compile that method, and the solution is to prevent it from being JITted.

Keeping a tidy history with Git.

,

Git is a powertool that allows you to modify your history and be selective with the contents each time you commit. I believe this power should be used to keep as tidy a history as practically possible.

Mind you that "modifying history" does not include history that has been pushed and is visible to others. While it is possible to do, it is also considered a cardinal sin among Gitters, because you ruining the repositories of people who have pulled the history that you since modified.

There are two qualities that I try achieve when I prepare a commmit: coherence and consistency. Explanation follows.

Consistency is about not committing a broken build. This is important in any SCM, but with Git you might argue that only the latest commit in any push needs to build - I beg to differ.

The reason that every commit must at least build, is for the sake of any future git-bisect. I was debugging a nasty memory leak in a Java application once - a server was leaking interned Strings. The project was kept in CVS but for the purpose of debugging (among other things) I had exported the repository to Git. This allowed me to hunt for the commit that had introduced the memory leak with git-bisect, which I though was pretty clever. That is, until I happened upon a commit that didn't built. If I could not build the software, then I could not test it for the presence of the bug that I was hunting. And making the software build means changing it, which introduces the risk of getting a skewed result from the test - what if my changes removed or reintroduced the bug? Or some other bug that would mask the real bug that I was hunting? I don't recall exactly how I handled the situation but I certainly wasn't happy about it.

So consistency is important, and foul-ups in this regard is the reason (or at least one of the reasons) why "git commit --amend" exists.

Cohesion is about not mixing unrelated changes in the same commit. Think of the git-bisect use-case above; now that you have finally found the bad commit, it turns out that it implements one new feature, two refactorings and four files have their indentation corrected - good luck finding that bug.

But that's not even my primary concern with cohesion. My primary concern is actually code reviews: if you had to review the hypothetical commit mentioned above, then you'll have your work cut out, and it won't be much fun. Instead, if things were properly split up in 4 to 7 commits (depending on how you cut the indendation changes) then the review would go much easier: you'd know that you can gloss over the indentation changes pretty quick, verify the refactorings with good speed and save your best brain-cycles for the new feature.

I find the following list of "themes" to be pretty good natural boundaries for commits:

  • Styling, indentation, spelling and grammar.
  • A new feature.
  • A bug fix.
  • Renaming, moving and adding files, or moving existing code chunks into their own files.
  • Refactoring or clean-up of cohesive code chunks.


If you happen to mix these changes in the same file, then you can use the interactive adding feature of Git ("git add -i") to split up the chunks of your changes and put them in different commits. For instance, I often correct indentation, style, grammar and spelling on sight and often while doing something else with the file. Then I break up the changes afterwards with interactive adding.

So, these are the qualities I reach for when I prepare to commit. They are just guide lines, though. Common sense is important: if I am certain that I will get a better history by breaking some of these rules, then I will do so. And I sometimes do, though it is rare.

Best programmers 28 times better than the worst? I'm not sure.

,

In his book Facts and Fallacies of Software Engineering, Robert L. Glass cites from a research paper from 1968, that the best programmers can be up to 28 times better than the worst.

That paper is Sackman, H., W. I. Erikson, and E. E. Grant. "Exploratory Experimental Studies Comparing Online and Offline Programming Performance." A brief, but publically available, treatment of it can be found here and most like elsewhere too if you ask Google.

Now, I know it's bad form to critisize a paper I have not read, but you need an ACM account to get at the real thing, so I will instead base my opinion the linked blog post, and otherwise be brief about it.

I have two points of critique, and the first and most obvious one is regarding the age of the paper. It was published 41 years ago and its primary purpose was to figure out whether time-sharing or batch processing systems were the most productive - which it did. And then, "almost in passing," apparently, they present numbers showing that the best programmers in their study were up to 28 times better than the worst.

I have to wonder whether the numbers have changed. Have the worst programmers gotten better, or worse? And are we even able to distinguish such a number from a difference among the best programmers? I have absolutely no idea. Most people I know, myself included, wasn't even born 41 years ago.

The second peeve I have with the unkown content of that paper, or perhaps rather the 28:1 quote in the context of that paper, is the dataset. While I don't know how many programmers participated, I do know that they were tasked with solving two programming problems.

While such a method is able to show peaks such as "up to 28 times better," it fails to show whether this grade-28 master is able to keep ahead in the long run. The two programming problems they were presented, were pretty small in size. Especially compared to many of the projects that are going on in the real world right now.

Is the master able to keep his times-28 advantage throughout a project that takes a year to complete - or at least is "normally" estimated to a years worth or work? Assuming they will finish at some point, how long will that same project take the worst programmer to complete?

It is possible that operating under the assumption that the worst programmer will, eventually, finish the project, is a pretty far stretch. But so do the "28 times better" seem when considered in the long run. Given the numbers are based on a mere two programming problems, the 28 times may have been a fluke. Maybe the guy was just lucky - you can argue that it takes skill to be that lucky, but still.

I once spent one and a half months hunting nasty memory leak (and I don't ever want to do that again), but once I found it I just had to change two lines and it was fixed. I could have been lucky and found those lines after a week, or never introduced the leak in the first place, but unfortunately it didn't happen like that.

In conclusion, I don't think that paper has the material to justify "up to 28 times better" as fact. The difference is there, for sure. And many people seem to feel in their gut that it may be around 10 times better. That's also a nice round number, but I won't claim it's a fact though.

My Estimates Have Gotten Worse

, ,

I'm now into the third chapter of Clean Code, and I am already starting to use what I learn in practice.

The book is showing me a new level to reach; it's pretty high but looks doable. Using the words of the software craftsmen, it is raising the bar.

But, as with all newly gained knowledge, its use is not automatic. I am expending a noticably greater amount of brain cycles to keep the quality of my code up at this new level.

This is, noticably, affecting my productivety, as understood as the rate at which I get stuff done. I don't have any hard numbers on this because of my failure to formalize my estimates (I know, I know... many bad excuses goes here).

So, while I'd like to think that the quality of the code I'm producing is higher, I am also taking longer to produce it. This means that previous experiences about how long things take have grown increasingly inaccurate. And due to lack of retrospectives and formalism in my estimation process, this growth has been without control.

I am not saying that I'm now suddenly catastrophically bad at estimating how long things take, just that the growth has been uncontrolled. I will assert that as this new knowledge moves from the frontal lobe to the spine, my estimates will steadily return to normal (unless I start to actively improve them, in which case they will become better).

This movement of knowledge is also known by the name "experience," and the only way to atain it is through practice.

Considering this, all I can do is to keep at it and improve, but remember to keep a close eye on the schedule, and try to add in the extra overhead whenever I estimate.

Reading spree

,

I read Java Concurrency in Practice (Brian Goetz) and it taught me something important. It showed me a whole new "dimension" (I can't explain it any further than that - words fail me) of reasoning about code. This was not the intent of the book, but rather a side-effect of the examples in it and its approach and focus on correctness.

I now write all my code with this same focus and attention to correctness, and I always reason about my code through this new dimension that I have discovered, and I am always concious about the thread-safety of my code.

Then I read Release It! (Michael T. Nygaard) and it showed me that my newfound attention to correctness could, and should, be extrapolated to a much higher inter-system scale. The appraoch is now less about fundamental reasoning, and instead more about making concious design decisions.

The situations where I need to apply this knowledge are fewer and farther between, but their effects are equally profound and, I dare say, even more noticable and visible to the people who surround the systems I work on, and my fellow programmers.

Then I read Java Management Extensions (J. Steven Perry) and it sucked. Moving on.

Then I read Facts and Fallacies of Software Engineering (Robert L. Glass) and while my code is unaffected, I still learned something important. Actually, I learned many things from this book. But the book itself is like a window through which, if you care to look, sense something called experience.

Here's a guy who at the time of writing (2002) had been a software practitioner for 45 years. He has seen things, tried things, evaluated things, researched things and written a ton of code. He does not use the word "fact" because he is pompous. When he says "fact," he implies anectdotal evidence and research papers (that he has actually read and understood).

Where was I going with this? Oh, yes. The facts and fallacies are worth remembering. And that's about it. This stuff is good to know, because, it just is. Okay?

Now I'm reading Clean Code (Robert C. Martin) and I have only just made it past the introduction, but already sense something. I get this weird feeling. You'd think that any reasonably experienced programmer knows what clean code is, knows it when he sees it, knows how to write it and knows how to change bad code into clean code.

Do you know what clean code is? Can you tell me? If you think "yes" to these questions, then try to imagine yourself in the classical elevator-pitch situation and tell me, right to my face, what clean code is. Do you still know what clean code is? Can you even point to an example that you are certain is clean code?

I have just made it through the introduction and just realized that I can't do any of those things. And what's more: clean code is more important than I thought before I started on this book.

I can see my horizon expanding but I have no idea how far it will go - that's the feeling I have about that book, right now.

I wonder what I'll read next :smile: