Testing is Science, Type Checking is Math

testing is science, and type checking is math

Testing software is like empirical science. You run experiments to see if there’s something wrong with a program.

Type-checking software is like math. You use cold, hard deductive logic to say if something is correct or not.

Science uses tests

Edsger W. Dijkstra famously said that “Program testing can be used to show the presence of bugs, but never to show their absence!” Testing can show that a program is incorrect, but it cannot show that a program is correct. You can find bugs by testing, but you can’t show that there are no bugs by testing. In any given piece of software there might be an infinite number of moving parts and factors. Who knows, there could be some cases that you haven’t covered.

Karl Popper argued that scientific theories can never be proven, they can only be disproven. You find ‘bugs’ in any theory by doing experiments, testing it, and showing it to be false. You can’t prove that a theory is true. You can work to make the theory a closer and closer approximation of truth, but you can never be sure that you have reached 100%, absolute truth. It’s a messy world out there. How can you guarantee that the theory has been tested under every possible condition, with every possible set of factors?

Edsger W. Dijkstra and Karl Popper pointing at each other in Friends' pointing meme

	Software Development	Empirical Science
uses	tests	experiments
to find	bugs	when a theory is wrong

In both empirical science and software development then, the experiments or tests to show what’s wrong with the theory or program. The more thorough your tests are, the more sure you can be that your program or theory is good. If Dijkstra and Popper are right, you can only approach correctness. You can get and infinitely closer approximation of the truth, kind of like approaching a limit in calculus.

curve approaching 100% correct as more testing is done

Science Uses Inductive Reasoning

Empirical science uses inductive reasoning. If we take a sample of 100 people and no-one lives beyond 110, we can make a generalization and say that “people don’t live past 110.” If we do the same thing with ten million people, we can say this with more certainty. If someone lives longer, we know we need to adjust our age limit. We see a certain thing happening over and over again, and so we can try to assume there is some law behind it. We can adjust our laws when we see other cases that prove them wrong. It’s not perfect, but it often works.

In the same way, software testing works to test hypotheses about how our software works. “If I type in this information, I will expect this response.” You have a theory of how a program is supposed to work. You can test this by tring a bunch of different inputs, trying to see if at any point this theory is wrong. But as Dijkstra said, your tests can only show you when that theory is wrong. They cannot show you that that theory is correct.

Failing (or falsification) is the basic useful action that tests can do. It’s the litmus test for a test being alive and working. Notice that in Test Driven Development the first step is to write a test that fails. First you write a test for something you know the program won’t do, and then you watch it fail. That shows you that the test is really working, and it’s really ‘getting at’ that problem. Then you adjust the program to be able to pass that test. At this point you know that you have changed something, and your theory about the program is a little bit stronger.

We can expand our test cases to make the testing more rigorous, and to be more sure that our program is correct. We can even use property-based testing, where we essentially set up programs to be an army of scientists running a whole bunch of tests on our programs using random data. But we are always approaching that point of 100% surety. Perhaps something else will break. Perhaps this or that browser will do something weird. End-to-end testing can cover more of these cases. But unless we have the most trivial example programs, can we ever cover every possible test case?

Testing software, like empirical science, can either be fairly straightforward or very messy. Science tries to find order in the chaos of the everyday world. It tries to test what it can, and tries to make better and better theories of how things work. Sometimes, like with classical mechanics, we have an extremely good understanding of what is going on… until we get to that weird edge case of quantum mechanics. Shoot! Now we are faced with a lot of mystery. We have to try this or that and see what works. Sometimes we will have a very good handle on what a program is doing. We basically know exactly what it will do. Other times, we have almost no idea how some program is working.

I remember working on one particularly terrible mess of code I created. As I kept building this monstrosity I had less and less of a clue as to what was going on and why. Towards the end I was just changing random stuff and holding my breath, hoping to see the green checkmarks. Finally I was able to assemble comprehensive battery of 300 or so tests, and I breathed a great sigh of relief as I saw them all light up green. But then, months later, I still kept finding bugs.

Math Uses Logic and Deductive Reasoning

We see that empirical science is often messy and uncertain. But pure math, in contrast, is clear and pristine. Pure math lives is a world of pure logic and axioms. Given a set of axioms, you can use hard, deductive logic to see determine if statements are true or not. Math has the privilege of operating in a world of absolute truth.

In A Mathematician’s Apology G.H. Hardy contrasts the work of physicists with the work of pure mathematicians. He boldly claims that “the mathematician is in much more direct contact with reality” than the physicist. He explains that physicists try to line up the fuzzy mess of reality they’re faced with with some clear rules or relations. “A mathematician, on the other hand, is working with his own mathematical reality.” (section 24) Scientists have to work with a approximations of an ideal. The have to try to make that fit with the messy behavior of things in the real world. Mathematicians work with an ideal world of truth and logic.

This mathematical world of pure logic and deduction also corresponds perfectly to the type checking we use in software development. See the talk Propositions as Types by Philip Wadler. Just like a logician can put together axioms and be assured of the truth or falsity of different algorithms, we can use type checking to see if the pieces of our program fit together correctly or not.

slide from Philip Wadler's talk on Propositions as Types

Incredibly, the perfect world of deductive reasoning also exactly mirrors the systems we use for type checking the programs we write. And herein lies the magic: we can steal from the mathematicians ideal world of pure logic and build real-world programs using these incredible axioms. It’s a wonderful experience to be typing and have the type checker, in real time, show you when something is true or false. I can change one thing in a massive TypeScript project and instantly see the red lines appear where things break in other files. I can know that a program is correct not just because I’ve tested a whole bunch of different cases, but because I know that it does fit together logically.

Tests only work to test something that’s been built. They can make sure an existing piece of software doesn’t have (certain) errors. As soon as you change the building, you have to change the tests.

But type checking does something much more incredible. It allows us to build out endlessly, and ensures that whatever we build our structure stays intact and safe. We can re-form something or add a big new addition, and if we stay within the confines of logic, we know that everything still all good. This allows for tremendous power in building and exploring things. The benefits are undeniable. It’s no wonder that typed languages like TypeScript and Rust are gaining popularity.

Of course, this is only true if the type system is sound and complete. The popular type systems mentioned above do NOT cover everything, and they are NOT fool-proof. Althought these type-systems will cover a lot of things in your code, you can still make errors and break things without being warned by the type system. But people are working on better, more complete systems, using dependant and linear types. With these sound, complete systems, people can be sure that the entire program is logically, 100% correct.

With these advanced type systems, you can do so much more than just make sure a certain variable is a String or an Int. You can make sure that the entire algorithm is 100%, logically sound. Uncle Bob was wrong to say that “The correctness of your type model has no bearing on the correctness of the behavior you have specified.” The more rigorous the type model, the more it will ensure correct behavior. And with more advanced type systems and things like dependant types, you can type the behavior for a lot of things as well.

It’s wonderful to see how mainstream programming languages are benefitting more and more from type checking. It’s great to see how we don’t always need to rely on empirical testing, but can also build things out, knowing they are correct by automatic deduction and logic. But it is not always easy or feasible to have 100% sound, type-checked programs. Usually a lot of stuff we build isn’t perfectly typed and proven. And so we still need to run a lot of tests to try to find the mistakes in what we’re building.

Testing does not replace type-checking

Does science replace math? Would scientists ever say that experiments can replace the need for math and formulas? Of course not. In the same way testing doesn’t replace type-checking. Any good scientist would reach for as much math as he can to try to make sense of what is going on. In the same way, any good programmer will use the logic of type theory to see if their program lines up. Some might do that informally in their head, using a dynamic language. Others rely on the automatic rigor of a type-checker.

Use both math and science / types and testing

When trying to solve problems in the world, people will use as much math as possible to try to make sure that things line up. But there are lots of places where we’re not sure how to apply the math, or we don’t know exactly what’s going on. And then people need scientific experiments. Then, just to make sure that their math was correct, people probably need to do some more experiments.

In the same way, as far as possible, use types to reason about how a program works and guarantee that it is correct. And where you can’t get types to reach that far, use testing to try to get closer to a bug-free program. And just to make sure, probably use more tests to cover as much as you can.