3-22-04 How to Argue about Typing
Many people observe that type checking is a religious discussion
best avoided. I often agree, having started more than my share of fires in this
area. However, there are a few issues surrounding types and type checking that
capture the essential distinctions between programming languages. This
understanding makes the pitfalls worth the risk, so in this article (from which
I will draw my
Closing Keynote Address at the Python Conference),
I will look at various issues and arguments surrounding the concept of type, and
in particular examine the phenomenon of "latent typing" (also called "weak
typing"), why the concept is powerful, and how it is expressed in different
languages.
Background: When I was in junior high school in Southern California,
they tried to teach me Spanish by showing us films about "E Man" (a dog, I
think), and albondegas (meatballs). That's about all I remember, because I
reacted strongly when the teacher would speak Spanish to us and then expect us
to reply with something meaningful. I was not conscious enough to see the
motivation for learning this, and the effort seemed painful. The same was true
for playing the guitar (it makes your fingers hurt!) and any number of other
potential abilities that never manifested.
So I am sympathetic with those who resist learning more than one language.
I'll probably only ever know a few phrases of non-English languages, and it is
only through experience that I've discovered the incredible value of learning
more than one programming language. The biggest payoff is this: I am able to
think about things in Java or C++ that I was unable to conceive of before I
learned Python. And from what I've read and the code that I've seen, very few
single-language users have been able to conceive of these things, either. Your
language really does constrain your thoughts.
Several years ago, I noticed that a number of programming language features
revolve around particular philosophies of type checking. These philosophies had
as their foundation the idea that a particular way of doing things will always
yield better results. In addition, the meaning of various terms tends to be
given in terms of the language that one knows and likes best.
I stepped into the middle of this when I began using the term "weak typing"
(which I have since observed is quite different from "weakly typed"). At the
time, the term seemed to come from an authoritative source, but now I'm unable
to track it down and discover its genesis. In Java, I also produced a hailstorm
of protest when I asserted (see "Thinking in Java, 3rd edition") that checked
exceptions, where the compiler forces you to write code to handle the exceptions
from particular calls, rather than allowing you to decide whether or not to
handle the exception at that time, were an experiment in that language that
largely caused more trouble than they were worth (not to say they aren't useful
sometimes, but they get in the way more than they help). Both of these
discussions seem to touch deep nerves in fundamental belief systems.
The value of what I've been calling "weak typing," but which seems like it
may be more predominantly called "latent typing," is fascinating, whatever we
choose to call it. Basically, it's how you say "I don't care what this type is"
in certain programming constructs ("...but I still care that the type behaves
correctly..."). In C++, this construct is the template. In Java, and C# they use
the term "generic" which implies the same thing, but in practice it can only be
as generic as the root class Object. But in Python (and Smalltalk), it's
just the way you define any function:
def speak(anything):
anything.talk()
Now, I can say:
class Dog:
def talk(self): print "Arf!"
def reproduce(self): pass
class Robot:
def talk(self): print "Click!"
def oilChange(self): pass
a = Dog()
b = Robot()
speak(a)
speak(b)
speak() doesn't care about the type of its argument, so I can pass
anything to it, so long as the object I pass supports the talk() method.
This example easily translates to C++:
#include <iostream>
using namespace std;
class Dog {
public:
void talk() { cout << "Arf!" << endl; }
void reproduce() {}
};
class Robot {
public:
void talk() { cout << "Click!" << endl; }
void oilChange() {}
};
template void speak(T speaker) {
speaker.talk();
}
int main() {
Dog d;
Robot r;
speak(d);
speak(r);
}
What is this? It's almost as if you're inventing a new type: "the type
acceptable to speak()," because it certainly doesn't refer any other type
that exists in the system. You might also argue that speak() defines two
basic categories of types: those acceptable to speak() and everything
else. Neither of these definitions work for me. It appears that there is either
no type specified for x, or that the type constraints on x have
been weakened to allow many different types to be used.
After dealing with terminology, This article looks at the way weak/latent
typing is achieved in Python and C++, and what generics look like in Java and C#
(these don't really support latent types).
Weakly typed vs. Weak typing
C++-Proxy: "C++ has strong, static type checking."
Smalltalk, Python, etc. Wonk: "C/C++ is weakly typed because it has
casts and unions."
Proxy: "That's a bit extreme, isn't it? How about if we say that C++
is strongly, statically typed with a hole in the type system: casts (and who
really uses unions, anyway?)."
Wonk: "Nope. If a language allows an object to accept any incorrect
message, then it's weakly typed. Java also allows casts, but it checks them,
both at compile time to see if the cast is feasible, and at run time to ensure
it is correct. So Java is strongly typed, C++ is weakly typed."
Proxy: "So you're saying that if the language is strong everywhere but
it has a single trap door, you're going to call the whole thing weak?"
Wonk: "Totally weak, dude."
Clearly, one of the issues revolves around definitions and how closely you
stick to them. So first we have to decide "what is a type?" Here are two
different ways to define type. An object may be:
- Syntactically compatible: The object provides all the expected
operations (type names, function signatures, interfaces)
Or:
- Semantically compatible: The object's operations all behave in the
expected way (state semantics, logical axioms, proofs)
The first approach is often considered type (it provides a particular
interface) and the second is usually called class (describes
implementation constraints). I think this fits closely with the issue at hand:
we are predominantly concerned with the operations that can be performed upon
objects, and the only semantics of interest are whether those operations are
safe and if the language system consistently reports unsafe operations (e.g. C++
allows old-style unchecked casts and doesn't report anything if you use
them).
- Does the language apply the type constraints to the symbols, or to the
objects?
You apply the type constraints to the symbols especially if you have limited
run-time support (C++). It's possible to do both, as with Java, which constrains
the types and usage of all the symbols at compile time, but also has enough
runtime support to perform limited dynamic checks. If it can't check something
statically (array bounds or null pointers, for example), it checks them
dynamically.
- Are the constraints applied before the program is run (static type checking)
or while the program is running (dynamic type checking)?
And again, Java does some of both. But the dynamically-checked languages seem
to have both more flexibility and at the same time better ability to make sure
an object is properly treated. They can afford to treat the symbols casually
because the objects guard themselves against improper use at runtime, rather
than relying on the compiler to guard them.
- Does the language "guess" the type based on usage (type inference), or does
it know via a declaration somewhere (manifest typing)?
Type inference seems to often get confused with dynamic typing, because the
symbols can appear to have no type (or no fixed type). I have no experience with
true type-inference languages, but they seem rather mind-boggling.
- If a language is strongly typed, does it ever relax the type constraints a
bit in order to allow more flexible programming?
This is my key area of interest in this article. Note that in the dog-robot
example above, many people jump to the conclusion that because the
anything in the speak() argument list is not constrained, that
there is no type safety. However, strong type safety is maintained you
still cannot send an improper message to an object.
Lexicon
- Static typing: Types are checked at compile time
- Dynamic typing: Types are checked at run time
- Strong typing: You can't successfully apply an improper operation
(send a bad message) to an object
- "Weakly typed": you can successfully perform an improper operation on
an object. May or may not be a common confusion with the following term, but
people use it anyway.
- Weak typing a.k.a. Latent typing: type constraints are relaxed in a
few specific cases to make programming more flexible and powerful. (Is "Latent"
actually the right term? It's really about no (named) type at all!)
- Type inference: you don't have to say what the types are, the
language system figures out types based on how they are used (ML family of
languages, currently OCAML)
- Manifest typing: you must specify the types when you create
them
- Duck Typing: A term apparently adopted by Ruby, to describe
latent typing. The initial description confused me into thinking that it meant
the following (which I think is actually a more interesting usage of the phrase):
(a reference to "Duck Tape,"
actually "Duct Tape"), where you can take an object of an existing type and add
new methods to that object. I think we can do something like this in
Python. The value of duck typing seems to be the
ability to create an adapted object without creating a new adapter class (see
Adapter in GoF a.k.a. Design Patterns).
But apparently the Ruby folks simply mean "latent typing" as in the "if it walks like
a duck..." quote further on in this article.
The phrase "Implicit typing" would seem to neatly describe what I'm talking
about here: an implied type is created when you define a function, template, or
generic. Unfortunately, Fortran has already co-opted this term, with its "first
letter of the identifier implies Real or Integer" scheme. I'm tempted to vary it
slightly to "implied typing" since it is so close to the mark, but I'm sure we'd
end up with the same issue as "weakly" vs. "weak." (I've notice a number of
people using the term "implicitly typed" on the web).
la·tent
- Present or potential but not evident or active. In existence but not
manifest. Latent talent.
- Psychology. Present and accessible in the unconscious mind but
not consciously expressed.
A fingerprint that is not apparent to the eye but can be made sufficiently
visible, as by dusting or fuming, for use in identification.
It's there, but not explicit. That's a pretty good word after all. Or:
If you only need something which quacks, don't worry about checking that it's
also a duck.
Andrew Dalke
Of course with latent typing we are not even interested naming it a duck in
the first place. Perhaps this is the root of the problem although I am
not religious, I am at my core a spiritual person and so I believe that the duck
is there even if I see only scant evidence for its existence.
Templates & latent typing vs. Java Generics
Before templates were introduced to C++, you could create generic containers
using macros. For the library designer this was painful, for the client
programmer somewhat less so. When templates were introduced, I explained that if
C++ had a singly-rooted hierarchy (everything ultimately inherits from
Object), these would not be as necessary since the containers could be
written to hold Object, and therefore automatically hold everything. This
is the way Smalltalk had done it and they had seemed to be pretty successful.
Of course Smalltalk used latent typing, and so did not require the casts, but
when Java followed this prescription it had static typing, and so you had the
benefit of containers built with a singly-rooted hierarchy, but now you had to
cast everything up and down all the time.
The new solution in Java/C# is the so-called "Generic," which implies
"anything" but which is actually constrained. This solution is primarily
focused on making collections much more civilized by eliminating casts:
import java.util.*;
class Dog {
public void talk() { System.out.println("Woof!"); }
public void reproduce() { }
}
class Robot {
public void talk() { System.out.println("Click!"); }
public void oilChange() { }
}
public class DogAndRobotCollections {
public static void main(String[] args) {
List<Dog> dogList = new ArrayList<Dog>();
List<Robot> robotList = new ArrayList<Robot>();
for(int i = 0; i < 10; i++)
dogList.add(new Dog());
// dogList.add(new Robot()); // Compile-time error
for(int i = 0; i < 10; i++)
robotList.add(new Robot());
// robotList.add(new Dog()); // Compile-time error
for(Dog d : dogList)
d.talk(); // No cast necessary
for(Robot r: robotList)
r.talk(); // No cast necessary
}
}
This is definitely an improvement for collections and a few other things.
However, you cannot say "I don't care what type this is" for an argument, like
you can in C++. So you cannot say in Java/C#:
class Communicate {
public static <T> void speak(T speaker) {
speaker.talk();
}
}
The only thing you can say is "This thing can be no more specific than
Object." So this compiles, because I call an Object method for it:
public class NothingButObject {
public <T> String f(T anyObject) {
return anyObject.toString();
}
}
This turns out to work just fine when you're dealing with Java collections,
which are already constrained to hold nothing more specific than Object
(so that they can be used to hold any type).
But if you want to say something more general than that, to write true
"generic code," you cannot. To apply the communicate() function to dogs and robots, you have to do this:
interface Speaks { void talk(); }
class Dog implements Speaks {
public void talk() { System.out.println("Woof!"); }
public void reproduce() { }
}
class Robot implements Speaks {
public void talk() { System.out.println("Click!"); }
public void oilChange() { }
}
class Communicate {
public static <T extends Speaks> void speak(T speaker) {
speaker.talk();
}
}
public class DogsAndRobots {
public static void main(String[] args) {
Dog d = new Dog();
Robot r = new Robot();
Communicate.speak(d);
Communicate.speak(r);
}
}
You must constrain the generic type accepted by the function to conform to a
class or interface which contains the operations you are calling inside that
function. But of course this is no better, and actually more confusing, than
simply using the interface alone as the argument:
class Communicate {
public static void speak(Speaks speaker) {
speaker.talk();
}
}
So what I was really hoping for in Java generics was true latent typing:
<T> void f(T objectOfAnyType) {
objectOfAnyType.anyOperation();
}
But what I got was cleanup of templates and a few other features, but none of
the power of true generic types. The primary argument against true generic types
in Java is that "it isn't type safe," which I hope to show is incorrect later in
this paper.
C++ templates always allowed true generic code. Their downfall was that the
compilers have generally produced terrible error messages when you misuse
template code, but many people came to the conclusion that templates were
therefore bad. The problem was that it is apparently messy to figure out at
compile time the function that you tried to call, and that it wasn't there, and
to report it in a simple way.
Another argument against C++ templates is that they cause code bloat because
a copy of the template is laid down each time you instantiate it for a new type.
However, code bloat of this kind generally results from a misunderstanding of the
feature; in particular, putting all your code inside the template instead of inheriting
or using composition. The base class or member objects hold the code that doesn't
need to be templatized (or reproduced).
But the basic idea that I can say "it doesn't matter what type this object
is, as long as I can perform these operations on it" is very powerful. This is
the foundation, for example, of the STL algorithms. Since the algorithms work on
containers, which themselves are designed to be as unconstraining as possible so
they can hold any kind of object, it is as if there's an "implied singly-rooted
hierarchy" in C++ (with a very simple root class).
I believe that Java Generics can be used to build the equivalent of the C++
STL algorithms (many of them, anyway), but that the code may be somewhat
messier. C# seems to follow this same approach with its generics, but I have no
C# version 2 compiler so I can't verify this.
It appears that the next version of C++ (which should appear in 5-8 years if
history is any indicator), will have some kind of template constraint as an
added feature; however, it will be optional, and unconstrained templates will
continue to be available.
Flexibility vs. (Perceived) Safety
Java Wonk: Static type checking is necessary for safety.
Me: But Java does some checking at run-time. That's why you need
ClassCastException.
Wonk: OK, I'll say this then: For safety, as much static type checking
as possible is always desirable.
Me: Isn't it possible to occlude the meaning of the code if you have
too much type-checking ceremony around it? And we still haven't made a dent in
the "20 working lines of code per day per programmer" statistic.
Wonk: You're just complaining about finger typing. I have tools like
Eclipse that will do a lot of typing for me, so I get the best of both worlds:
reduced finger typing and greater type checking.
Me: Yes, but code is read much more than it is written. By making the
reader do more work you're slowing down the development process.
Wonk: On the contrary, this makes the code more explicit and thus
easier to understand.
Me: I find Python code much faster to read precisely because it's
shorter and more straightforward, and thus it seems much easier to understand
than when I'm constantly being interrupted by all the ceremony of Java.
Wonk: I don't know Python so I don't believe you. And I'm positive
that I want the guaranteed safety provided by Java's static type checking.
Me: What do you mean, "guaranteed?" The failure rate of software
development projects is at least 50%, and possibly as high as 80%?
Wonk: That's not verifiable. And Java has improved the success rate of
C++. It's sure made my life easier.
Me: True, few are willing to admit their failures so we can't get an
exact number except to say that it's very high, and very expensive. How can you
argue that this is a sure thing when the failure rates continue to be so high?
It does seem that Java improves the success of projects over C++, but how can
you say that anything is working when everything is mostly failing?
Wonk: But you want to reduce the amount of type checking. That will
make it worse.
Me: No, the amount of type checking is at least the same. It's just
when it happens that changes.
Wonk: If you make it happen at run-time instead of at compile time,
then it's possible that things can slip through the cracks.
Me: Let's look at an example where run-time checking happens in Java.
Pre-JDK 1.5 collections allow you to put any object in, and then you must cast
it back to the desired type when you pull it out. The correctness of the cast is
checked at runtime.
Wonk: Yes, very bad. You can put the wrong type into a collection. I
could put a Dog into a collection of Cat. Hey, I think that's an
example in your book.
Me: Yes, that would be a problem, but you'd get a runtime exception
when it happened, so you'd find out about it.
Wonk: But with static typing like we get with JDK 1.5 Generics it
won't happen at all. That will be a big worry solved.
Me: This happens to you a lot?
Wonk: Well ... no. Actually, I don't remember the last time it
happened. But I'm sure the problem is just lurking, waiting to crash an
important program.
Me: As long as you find out when it happens, you still find out. And
you can do that through unit testing.
Wonk: Static checking is better than unit testing, since I don't have
to write the tests.
Me: You eventually have to write some tests. There's always a
line where it goes beyond language syntax and becomes the semantics of your
particular program. Without your own tests you cannot have a verifiably correct
program.
Wonk: Unit tests are good, yes.
Me: So what I'm suggesting is that by acknowledging that you have to
write your own tests anyway and thus perform dynamic checking of your
application you can consider moving the line a bit and making things a
bit more dynamic. I think the benefit you get in easier-to-write, easier-to-read
coding will far outweigh the loss of a bit of the static type checking. Keep as
much of it as possible, but move a little bit into the runtime arena.
Wonk: Static type checking is too important. It's what Java is about.
More static type checking is always better.
Is Latent Typing Necessary?
In the end we have to ask whether this is an essential feature. What does it
really do?
It allows you to cut across class hierarchies, calling methods that are not
part of a common interface.
Classes only share the methods in their common base class. C++ has no
ultimate Object base class, so the need for latent typing is more
immediately obvious. Java/C# have a singly-rooted hierarchy and
className can only call Object methods in T. They are
"generic," so long as you don't step out of the bounds of what's already generic
(Object). This works fine for collections and any other code which only
needs the fundamental operations of Object.
However, it's still possible to have disjoint hierarchies in Java and C#,
with common methods but no common interfaces. The original authors of two
separate class hierarchies cannot be expected to predict every commonality that
might occur, nor to anticipate that you might want to write a single piece of
code that is applied to both of them.
This is more likely to happen in existing hierarchies, ones that are handed
to you. If you want to perform common operations across multiple disjoint
hierarchies, you will be forced to write duplicated code. If you argue against
this, you are effectively saying "it's OK to duplicate code," although it may
seem like it's just a little bit and not very often.
But what if this is another feature that you haven't been using because you
didn't know about it? Like Objects for a procedural programmer, for example. Or
aspects the new version of JBoss is reportedly being built primarily with
aspects and this seems to have greatly simplified the architecture. C++
programmers benefit by using templates for more than just container classes
(although I will admit that template metaprogramming still seems to me to be
beyond what most people will want to do).
Even if you do argue that all classes can be shoehorned into a hierarchy to
achieve a common interface, there is also the argument of the maintenance
overhead that results from the extra code. As Java programmers we've been like
frogs in saucepans, adjusting to more and more code in order to do basic things.
With the JDK 1.5 "foreach" syntax some of that code has been reduced, and this
is an admirable development (alas, one that may never be duplicated in the
future of this language). But part of my argument about checked exceptions is
that it requires more finger-typing that may not be necessary. The counter-
argument is that we have tools like Eclipse
that make this unnecessary, but the overhead of reading and maintaining the code
which is where most of the money ends up going still increases.
But consider the opposite view. At the SD2004 conference, James Hobart
pointed out the distinction between "probability" and "possibility." In this
case, it is arguably more probable that C++ will have disjoint class hierarchies
than will Java. Java generics seem to work adequately for collections, and
although the possibility that you may encounter disjoint class hierarchies that
you'd like a single piece of code to cut across, perhaps the probability is not
that high that it will happen very often in practice. This is what Hobart called
"going down the happy path." In Java, we have no choice but to take this path,
but in other languages we may see through experience what is lost by doing
so.
We're really talking about testing
In the end, these are all just different kinds of tests. We can consider them
as a spectrum:
- Testing by the compiler (static type checking)
- Testing by the run-time system (dynamic type checking + more)
- Unit testing of your classes (D at www.DigitalMars.com is the only language
with keyword support for this)
- Conformance testing of your program (very little standardization in this
area)
The compiler and run-time system can only know so much. There can only be a
certain set of tests that are general enough that they can be applied to every
program. At some point you must begin writing your own tests in order to test
the semantics of your particular program, if you want to ensure its correctness.
What we are talking about, then, is not whether tests are critical (they
are), but when and how these tests take place. More importantly, about balancing
the expressiveness and clarity of a language with the place in the spectrum
where these tests occur. My opinion: Programs are prose, and must be readable.
Static checking is a good thing if it doesn't impact the clarity and
expressiveness of the language. If clarity and expressiveness are impacted, then
the checking should be moved to reduce that impact.
This opinion is based on my belief and experience that you cannot rely on
static checking to verify program correctness, and therefore it is unreasonable
to obscure the meaning of your program with high-ceremony syntax under the
illusion that it is somehow safer. For full safety, you'll always have to have
run-time tests, and you'll always have to write your own tests. Given this, to
me it makes no sense to make your life harder by obscuring the code because of
this illusion.
Determinism
Albert Einstein was very disturbed by the Heisenberg uncertainty principle
(you can't know both position and momentum of small particles). He believed that
"God does not play dice with the universe" and postulated a "hidden variable"
theory that, when discovered, would allow us to completely determine everything.
Bell's theorem showed that there is no hidden variable theory.
Steven Hawking describes how you can have higher determinism nearer to an
event (in space-time) than further away. The effect of the uncertainty principle
is cumulative in space-time.
The Greeks believed that there was fate and free will together, which people
have always found confusing, since we demand one or the other exclusively.
Perhaps Hawking's approach explains the dichotomy.
It may be possible to create programs that are provably correct, but my guess
is that these programs will fill only a small fraction of the general solution
space. Static type checking is only one kind of testing, and very limited in its
effect as you move out in space time from the point of compilation, the
"software uncertainty principle" adds more and more noise to the results. I
think that only by testing throughout will you get a program that is as correct
as you are able to describe it.
Some notes I collected while researching this article:
"Beware object-oriented textbooks! Polymorphism does not refer to the dynamic
behaviour of objects aliased by a common superclass variable, but to the fact
that variables may hold values of more than one type in the first place. This
fact is independent of static or dynamic binding."
Anthony J. H. Simons: "The Theory of Classification, Part 1: Perspectives on
Type Compatibility", in Journal of Object Technology, vol. 1, no. 1, May-June
2002, pp. 55-61. http://www.jot.fm/issues/issue_2002_05/column5
Directing vs. Enabling
http://martinfowler.com/bliki/SoftwareDevelopmentAttitude.html
Smalltalk: "Smalltalk developers usually refer to an interface indirectly by
saying that classes A, B, and C are "polymorphically compatible", meaning those
classes understand a certain group of messages."
http://www.jot.fm/issues/issue_2002_05/article1