char* ptr;
char[] = ptr[0..10];
- for one dimension, it would be a trivial extension. How about higher
dimensions?Currently, the D programming language has no support for rectangular multidimensional arrays of dynamic (i.e. runtime-calculated) size. Multidimensional dynamic arrays are only supported as jagged arrays, adding only range-information to what C already has.
For arrays of compile-time-calculated length, the language specification mentions, that the data is stored in a rectangular manner. Anyhow, the syntax for these nested static arrays holds a dangerous pitfall (see below) that makes true multidimensional static arrays desirable at least as some syntactic sugar.
This document describes a proposal for a language extension/change for D that would allow extremely powerful multidimensional arrays built into the language.
The proposal probably is far from complete. If you have corrections, comments or ideas, please contact me directly: Norbert@Nemec-online.de or join the discussion on the D language newsgroup.
The indexing expression p[i] allows unchecked, one-dimensional array handling as known from C.
Here, the data is stored in place (i.e. on stack, or directly in struct/object data) and copied as value on assignment. mytype[3] is a different type from mytype[2], sizeof(mytype[2]) == 2*sizeof(mytype).
Formerly the only type of dynamic arrays: basically a pointer that carries additional range information. "unstrided" is used for distinction from "strided" arrays to be introduced lateron.
not covered in this document
Note: Unstrided array references are, from the language point of view, not strictly necessary, as they could just be seen as special case of the strided array type to be introduced later. Yet, unstrided arrays are still kept once for backward compatibility, but even more important, for performance reasons. One-dimensional arrays are especially used for strings, where striding is hardly ever used. Even though the compiler might be able to optimize away some of the overhead of unnecessary striding information, it will only ever do so incompletely, which would lead to a performance loss especially in string handling.
Note: The name "dynamic array" should be used with care.
Personally, I prefer to call them "array references" as it expresses more
clearly what it is and how it behaves. There is a clear distinction between
the reference, which also carries the length information, and the data, which
is placed somewhere in memory. As with references in general, there may be
several pointing to the same memory.
The recommended style of string handling in D is "copy-on-write" (COW),
that is, if you want to write to a string that you do not "own", you should
make a copy of it first.
For general array references in other contexts, it may be very useful to keep
several writable references to the same memory, for example, passing a
reference to a routine, that should then modify the referenced data in some way.
One other reason against the name "dynamic array" might be, that, of
course, the size of the data allocated on the heap cannot be truly
dynamic. You have to know that size at the moment of allocation, and
any upsizing beyond this size can only be done by allocating some
bigger space and copying the data. Even though this happens
automatically under the guise of powerful language features, it still is
important for the user to be aware of this internal realization of
"dynamic arrays".
Before going on to the new rectangular arrays, a few words about nested arrays.
In general, all the different array styles can be nested in any order:
mytype[3][][char[]]* m;
would be "a pointer to an associative array of dynamic arrays of static arrays
of mytype". The danger now lies in dereferencing the elements, which
has to happen in reverse order with respect to the type declaration:
mytype x = (*m)["something"][27][2];
This is of course also true for nested static arrays, that were formerly
propagated as rectangular arrays. Correct code would look like:
const int A=2;
const int B=2;
mytype[B][A] M;
for(int a=0;a<A;a++)
for(int b=0;b<B;b++)
M[a][b] = new mytype();
Be aware of the difference between the type declaration mytype[B][A]
and the dereferenciation M[a][b].
This dangerous pitfall is inherent in the syntax using both postfix type modifiers and postfix dereferenciation operators. It is important to be aware of it and should already be reason enough to prefer true rectangular arrays over arrays where this makes sense.
These arrays actually add very little to the current specification of the D language. The static array type modifier simply is extended to have more than one range specifier, separated by commas. The corresponding indexing expression has to have the same number of indices:
const int A=2;
const int B=2;
mytype[A,B] M;
for(int a=0;a<A;a++)
for(int b=0;b<B;b++)
M[a,b] = new mytype();
Even though mytype[B][A] is internally the same as
mytype[A,B], they are different types. Implicit casting happens in
both directions.
Note: I considered making mytype[A,B] simply syntactic sugar for mytype[B][A], but that would cause difficulties with the indexing operator: M[a,b] would have to be syntactic sugar as well, which would clash badly with the syntax for slicing. The simple rule is now: N-dimensional arrays have to be indexed or sliced with N-dimensional indexing/slicing expressions.
To support truly rectangular arrays with a size calculated at runtime, rectangular array references are newly introduced in this proposal.
Note: "Array reference" is just another term for "dynamic array". I prefer the former term, as it makes intuitively clear that the object you are handling has reference semantics.
Every array reference has a fixed number of dimensions, which has to be known at compile time. The array reference itself consist of a pointer to the data, along with range and stride information for each dimension.
Zero-dimensional array references are legal. They are effectively a reference to exactly one element, and are implicitly casted to the underlying type. See: Zero-dimensional array references
mytype[[2]] A2;
mytype[[3]] A3;
Note: Former proposals used something like mytype[,] and mytype[,,]. The problem with this syntax is, that the dimensionality cannot be calculated at compile time (like from a template parameter). Furthermore mytype[] is already used for unstrided array references, which is something different from mytype[[1]].
mytype[[2]] A2; // variable declaration
A2 = new mytype[3,4]; // allocation of memory on the heap
A2[2,1] = 3; // assignment to array elements
mytype tmp = A2[2,1]; // reading of array elements
mytype[[2]] B2 = A2; // reference assignment. Both arrays now point to the
// same memory
assert(B2[2,1] == A2[[2,1]]);
Note: Even though, a language specification should usually not talk too much about internal representation, it makes sense to define it at this place, since it is needed to really understand the concept of multidimensional array references.
The internal representation of a multidimensional array reference mytype[[N]] could be written as a structure:
struct mytype(int N) {
mytype *data;
size_t[N] range;
ptrdiff_t[N] stride;
}
here, data always points to the [0,0] element of the array.
Note: When designing a concept for multidimensional arrays,
several other possibilities spring to mind:
Between these three, I decided in favor of the most powerful option. The size
of an array reference should not really matter too much, since a program would
usually not store many array references but rather few references to large
arrays. In terms of speed, the first two options do not show any difference.
The third option may add a little runtime overhead, but the compiler should be
able to optimize this away. In any case, the runtime overhead between the
second and the third option is no big enough to justify the language complexity
of two different types. For one dimensional arrays, the special unstrided array
is offered, since char[] is used for strings where striding really is not very
common.
Note: Arrays generally start at index 0, as known from C. Variable offsets, as some languages offer them, mean only additional runtime overhead, possible causes for confusion and questionable benefit and are therefore not supported.
In its full generality, this internal representation would, of course, allow all kinds of weird shapes and self-overlappings. An array reference is called:
1 <= abs(stride[i[0]]) abs(stride[i[0]])*range[i[0]] <= abs(stride[i[1]]) abs(stride[i[1]])*range[i[1]] <= abs(stride[i[2]]) ... abs(stride[i[N-2]])*range[i[N-2]] <= abs(stride[i[N-1]])
1 = abs(stride[i[0]]) abs(stride[i[0]])*range[i[0]] = abs(stride[i[1]]) abs(stride[i[1]])*range[i[1]] = abs(stride[i[2]]) ... abs(stride[i[N-2]])*range[i[N-2]] = abs(stride[i[N-1]])
1 = abs(stride[0]) abs(stride[0])*range[0] = abs(stride[1]) abs(stride[1])*range[1] = abs(stride[2]) ... abs(stride[N-2])*range[N-2] = abs(stride[N-1])
1 = abs(stride[N-1]) abs(stride[N-1])*range[N-1] = abs(stride[N-2]) abs(stride[N-2])*range[N-2] = abs(stride[N-2]) ... abs(stride[1])*range[1] = abs(stride[0])
Note:As can be seen above, the default alignment is Fortran style alignment. The reason behind this decision was, that the main application for multidimensional arrays is numerics, where interfacing Fortran code is more important in praxis than interfacing C code. Aside from interfacing with other languages, the decision is completely arbitrary, since the design described in this document handles both alignments equally well and allows easy and clean conversion between both.
To allocate multidimensional arrays on the heap, there are two different new expressions. The range of the individual dimensions can either be given directly or as an array:
mytype[[3]] A = new mytype[3,4,5];
int[3] tmpidx = [3,4,5];
mytype[[3]] A = new mytype[tmpidx];
Both have the same outcome, the latter is useful, when the dimensionality is
calculated at compile-time (like within templates).
Note: Once we have array literals, the former expression might actually just become syntactic sugar for the latter.
An array created by new is guaranteed to be aligned.
To allow user defined types using the multidimensional indexing syntax, the OpIndex routine is extended to take an arbitrary number of arguments as indices. Alternatively, the definition may use a single, static int[N]-array as argument. It is illegal to have both, OpIndex(idx0,idx1,...) and OpIndex(int[N] idx) with the same number of indices in the same class/struct.
Since the old syntax for the indexed assignment overloading distinguished OpIndex(idx) and OpIndex(idx,value) by the number of arguments alone, the latter is replaced by OpIndexAssign(idx0,idx1,..,value), or alternatively OpIndexAssign(int[N] idx,value).
Sorry, but I have not the slightest idea how to fit these into a reasonable syntax, especially, when it comes to partial slicing.
When assigning to an array reference, there are two possibilities:
The only special case to this rule: if the lefthand side of an assignment is an indexing expression, the compiler first checks for a possible OpIndexAssign routine. If that does not exist, the compiler tries to find a OpIndex routine, and, if that does return an array reference, the assignment is read as an array assignment.
Slicing is already known from old-style onedimensional, unstrided dynamic arrays:
char[] str = "0123456789";
char[] substr = str[6..9]; // nothing new here
assert(substr == "678");
Note: There is a proposed language extensions that would allow dropping range boundaries or using a new $ symbol to designate the range of the corresponding dimension for calculations. That proposal is fully orthogonal to this document. It would be very nice to have, but there is no need to discuss it at this place.
First extension to this is the possibility for strided slicing:
char[[1]] str = "0123456789";
// char[] is implicitly converted to char[[1]]
char[[1]] substr = str[1..8:4];
assert(substr == "15");
Here, the stride has to be positive and the lower bound may not be larger the
the upper bound. When the slice [mn..mx:sd] is taken, the range rg of
the new reference is chosen as the largest integer such that (rg-1)*sd+1 <= mx-mn
Now, this can - of course - be done on several dimensions as well:
mytype[[2]] arr = new str[4,5];
mytype[[2]] subarr = arr[1..4:2,2..5:2];
resulting in a 2x2 array reference. The data itself is not touched. The new
reference points into the same location on the heap where the original array
was created.
For array references of any dimensionality, the postfix operator [] means the full slice of the array. The only effect of this operator is, to turn a lvalue reference into a rvalue reference, which is necessary for array assignments.
While for one-dimensional arrays indexing and slicing are separate concepts, they can be merged for multidimensional arrays. Every index given exactly reduces the dimensionality of the resulting array by one:
mytype[[2]] arr = new str[4,5];
mytype[[1]] subarr = arr[,2];
mytype[[1]] subarr2 = arr[1..4,2]; // slicing and partial indexing in one step
mytype[[0]] subarr3 = arr[1,2]; // see below
mytype value = arr[1,2];
The last two cases make clear that a fully indexed array actually returns a
zero-dimensional array reference which is then implicitly casted to the
underlying type if needed.
A negative stride in an array reference means that the data is referenced in reverse order. This makes it possible to reverse individual dimensions of an array without touching the data in the array.
To create such a reverse slice, simply use a negative value for the stride in a slicing operation. This will result in a array reference pointing to the same set of entries, with the indexing running in opposite order:
char[] str = "0123456789";
char[[1]] substr = str[1..8:-4];
assert(substr == "51");
The internal pointer will, of course, again point to the [0] element of the
newly created array reference.
Stride 0 in is nonsense and will therefore result in a compiletime or runtime error.
Transposing of arrays means interchanging two dimensions. With array references this is also easily possible without touching the data. The property .transpose(int dimA,int dimB) returns a new array reference with just the stride and the range of the two given dimensions, resulting in a transposed view of the referenced data.
The property .transpose() without arguments will return a completely transposed array reference, i.e. the order of dimensions is reversed.
It might not be used often, but it is simple to implement, and it is just too cool to ignore: simply replacing two dimensions by one whose stride is the sum of the two original strides gives you a reference for the diagonal of the original array! The range of this new dimension is the minimum of the original ranges. The property .diag(int dimA,int dimB) returns such a reference. dimA is replaced by the new diagonal dimension, dimB is dropped, all dimensions higher than dimB are shifted by one. The range of the new diagonal dimension is set to the minimum of the ranges of the two contracted dimensions.
The property .diag sums up all strides and returns a one-dimensional array reference corresponding to the total diagonal of the original array. Again, the range of the new array is the minimum of all original ranges.
The concept of strides even allows slicing into structs and complex numbers that are stored in an array:
struct mystruct {
mytype1 m1;
mytype2 m2;
}
mystruct[[2]] A = new mystruct[2,2];
mytype2[[2]] A_m2 = A.m2;
assert(&A[0,1].m2 == &A_m2[0,1]);
complex[[2]] B = new complex[2,2];
real[[2]] B_re = B.re;
assert(&B[1,0].re == &B_re[1,0]);
Note: I'm not yet sure about the syntax here, but the concept certainly is extremely powerful.
There are different reasons why data might need to be cloned. Depending on the different needs, there are different versions of the .dupXXX property:
For each version, an extended form exists, that takes ranges of the new array as additional arguments: .dup(int r0,int r1,...) The resulting array is then truncated or filled up with the default values of the underlying type. Again, the compiler may optimize away the cloning if it can guarantee the correct alignment or the uniqueness in place. Like for the new expression, .dupXXX(int[N] r) is supported as well with identical meaning.
For all the different types of arrays, there is a number of implicit casts performed wherever this makes sense:
When applied on static arrays, the &-operator returns an array reference. For one-dimensional static arrays, it is lightweight, otherwise multidimensional.
Note: One could extend this rule to arbitrary types, where & could simply return a zero-dimensional array reference, which is then implicitly casted to a raw pointer if needed. (See below, in the chapter on zero-dimensional arrays.
All these properties are side-effect-free. They do not touch the data that was referenced. The input-reference is not changed, but rather a new reference returned. (Except, of course, for assignments to .length, etc.)
Note: The former properties .reverse and .sort are dropped. Both had side effects on the underlying data, making them stand out in a confusing way. Furthermore .reverse could be confused with reverse slicing, which does not touch the underlying data. Both operations should rather be implemented in the standard library.
The hidden power of zero-dimensional array references has been mentioned several times in this document. When thinking about this for the first time, I mainly viewed it as a rather superfluous feature that should only be introduced for completeness sake, to make it easier to write generic templates. Only on the second thought, I realized what a powerful feature this might become:
A zero-dimensional array reference basically is just a reference to one single item in memory. As mentioned in the section about partial indexing, a indexing expression always returns an array reference of reduced dimensionality. A full indexing expression therefore has to return a zero-dimensional array reference. Since, usually, one is interested not in the reference but in the value, such zero-dimensional array references have to be implicitly casted to the underlying type.
With this implicit casting, zero-dimensional array references have most of the capabilities of C++-style references, in that they actually are references, but can be used with the same syntax as values.
Looking back at the rules for assignments, it is also clear what happens when a zero-dimensional array reference is assigned to: Either it is a lvalue or it is a rvalue. In the first case, you get a reference assignment, in the second case, you get an array assignment which actually is a value assignment, since there is just one item referenced.
This topic has yet to be sorted out, but it should extend fairly straightforward from one-dimensional arrays.