// Copyright 2018 The Bazel Authors. All rights reserved. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. package com.google.devtools.build.lib.skyframe.serialization; import com.google.common.annotations.VisibleForTesting; import com.google.common.base.Preconditions; import com.google.protobuf.CodedInputStream; import com.google.protobuf.CodedOutputStream; import java.io.IOException; import java.util.HashMap; import java.util.IdentityHashMap; import javax.annotation.Nullable; /** * A framework for serializing and deserializing with memo tables. Memoization is useful both for * performance and, in the case of cyclic data structures, to help avoid infinite recursion. * *
Usage: To add support for a new type of value, create a class that implements {@link * MemoizingCodec}. To serialize, instantiate a new {@link Serializer} and call its {@link * Serializer#serialize serialize} method with the value and a codec for that value; and similarly * to deserialize, instantiate a {@link Deserializer} and call {@link Deserializer#deserialize * deserialize}. The memo table lives for the lifetime of the {@code Serializer}/{@code * Deserializer} instance. Do not call {@link MemoizingCodec#serializePayload} and {@link * MemoizingCodec#deserializePayload} directly, since that will bypass the memoization logic for the * top-level value to be encoded or decoded. * *
The memo table associates each value with an integer identifier. * *
Cyclic data structures can occur either naturally in complex types, or as a result of a * pathological Skylark program. An example of the former is how a user-defined Skylark function * implicitly refers to its global frame, which in turn refers to the functions defined in that * frame. An example of the latter is a list that the user mutates to contain itself as an element. * Such pathological values in Skylark are technically allowed, but they are not useful since * Skylark prohibits recursive function calls. They can also expose implementation bugs in code that * is not expecting them (b/30310522). * *
Ideally, to handle cyclic data structures, the serializer should add the value to the memo * table before actually performing the serialization. For example, to handle the recursive * list {@code L = ["A", L]}, the serializer should do the following: * *
However, this strategy does not work for all types of values. There is no way to deserialize * the recursive tuple {@code T = ("A", T)}, because tuples are immutable and therefore cannot be * instantiated before all of their elements have been. Rather than restrict serialization to only * mutable types, or add a special way for deserializers to modify seemingly immutable types, we * simply don't memoize immutable types until after they are fully constructed. This means that * {@code T} is not serializable. But that's okay, because objects like {@code T} should not even be * able to exist (barring a hidden API or reflection). In general, all cycles must go through at * least one mutable type of value. * *
Aside from mutability, there is another potential problem: One of the types' constructors may * enforce an invariant that is not satisfied at the time the constructor is invoked, even though it * may be satisfied once construction of all objects is complete. For instance, suppose a type * {@code Foo} has a constructor that takes a non-empty list. Then we would fail to deserialize * {@code L = [Foo(L)]}, since {@code L} is initially empty. Such a list could legally be formed by * putting other elements in {@code L} before creating {@code Foo}, and then later removing those * other elements. But there's no general way for a deserializer to know how to do that. Therefore, * it is the caller's responsibility to ensure the following property: * *
* * For any value that is to be serialized, if the value has children that directly or indirectly * contain the value, then the value must be constructible even when those children are in a * semi-constructed state. * ** * where "semi-constructed state" means any state that can be produced by the codecs for those * children. Other serialization systems address this issue by providing multiple hooks for types to * setup their invariants, but we keep the API relatively simple. * *
Round-tripping a value through memoized serialization and deserialization is guaranteed to * preserve the object graph, i.e., to not duplicate a value. For mutable types, a value can only be * serialized and deserialized at most once because it is memoized before recursing over its * children. For immutable types, although a value can be serialized multiple times, upon * deserialization only one copy is retained in the memo table. This is conceptually similar to how * Python's Pickle library * handles tuples, although in that case they use an abstract machine whereas we do not. * *
Wire format, as an abstract grammar: * *
{@code * START --> NoMemoContent | `NEW_VALUE` MemoContent | `BACKREF` MemoId * MemoContent --> MemoBeforeContent | MemoAfterContent * NoMemoContent --> Payload * MemoBeforeContent --> MemoId Payload * MemoAfterContent --> Payload MemoId * MemoId --> int32 * }* * where {@code Payload} is the serialized representation of the value. {@code Payload} may itself * contain complete memo-aware encodings of the value's children. */ // TODO(brandjon): Maybe make this more robust against a pathological cycle of immutable objects, so // that instead of failing with a stack overflow, we detect the cycle and throw // SerializationException. This requires just a little extra memo tracking for the MEMOIZE_AFTER // case. public class Memoizer { /** * Constants used in the wire format to signal whether the next bytes are content or a * backreference. */ private enum MemoEntry { NEW_VALUE, BACKREF } private static ObjectCodec
Codecs with this strategy will always serialize payloads, never backreferences, even if * the same value has been serialized before. This does not apply to other codecs that are * delegated to via {@link Serializer#serialize}. Deserialization behaves * analogously. * *
This strategy is useful for codecs that write very little data themselves, but that still * delegate to other codecs. */ DO_NOT_MEMOIZE, /** * Indicates that the value is memoized before recursing to its children, so that it is * available to form cyclic references from its children. If this strategy is used, {@link * MemoizingCodec#makeInitialValue} must be overridden. * *
This should be used for all types where it is feasible to provide an initial value. Any * cycle that does not go through at least one {@code MEMOIZE_BEFORE} type of value (e.g., a * pathological self-referential tuple) is unserializable. */ MEMOIZE_BEFORE, /** * Indicates that the value is memoized after recursing to its children, so that it cannot be * referred to until after it has been constructed (regardless of whether its children are still * under construction). * *
This is typically used for immutable types, since they cannot be created by mutating an * initial value. */ MEMOIZE_AFTER } /** * A codec that knows how to serialize/deserialize {@code T} using a memo strategy. * *
Codecs do not interact with memo tables directly; rather, they just define a memo strategy * and let {@link Serializer} and {@link Deserializer} take care of the memo logic and metadata. A * codec can dispatch to a subcodec to handle components of {@code T} compositionally; this * dispatching occurs via the {@code Serializer} or {@code Deserializer} so that the memo table is * consulted. * *
Codecs may access additional data from the deserialization context. For instance, Skylark
* codecs may retrieve a {@link com.google.devtools.build.lib.syntax.Mutability} object to
* associate with the value they construct. The type of this additional data is checked
* dynamically, and a mistyping will result in a {@link SerializationException}.
*/
// It is technically possible to work the type of the additional data into the generic type of
// MemoizingCodec and Deserializer. But it makes the types ugly, and probably isn't worth the
// damage to readability.
public interface MemoizingCodec If set to {@link Strategy#MEMOIZE_BEFORE}, then {@link #makeInitialValue} must be
* implemented.
*
* Implementations of this method should just return a constant, since the choice of strategy
* is usually intrinsic to {@code T}.
*/
Strategy getStrategy();
/**
* If {@link #getStrategy} returns {@link Strategy#MEMOIZE_BEFORE}, then this method is called
* to obtain a new initial / empty instance of {@code T} for deserialization. The returned
* instance will be populated by {@link #deserializePayload}.
*
* If any other strategy is used, this method is ignored. The default implementation throws
* {@link UnsupportedOperationException}.
*
* The passed in deserialization context can provide additional data that is used to help
* initialize the value. For instance, it may contain a Skylark {@link
* com.google.devtools.build.lib.syntax.Mutability}.
*/
default T makeInitialValue(Deserializer deserializer) throws SerializationException {
throw new UnsupportedOperationException(
"No initial value was provided for codec class " + getClass().getName());
}
/**
* Serializes an object's payload.
*
* To delegate to a subcodec, an implementation should call {@code serializer}'s {@link
* Serializer#serialize serialize} method with the appropriate piece of {@code obj} and the
* subcodec. Do not call the subcodec's {@code serializePayload} method directly, as that would
* bypass the memo table.
*/
void serializePayload(
SerializationContext context,
T obj,
CodedOutputStream codedOut,
Serializer serializer)
throws SerializationException, IOException;
/**
* Deserializes an object's payload.
*
* To delegate to a subcodec, an implementation should call {@code deserializer}'s {@link
* Deserializer#deserialize deserialize} method with the appropriate subcodec. Do not call the
* subcodec's {@code deserializePayload} method directly, as that would bypass the memo table.
*
* If this codec uses the {@link Strategy#MEMOIZE_BEFORE} strategy (as determined by {@link
* #getStrategy}), then the {@code initial} parameter will be a new value of {@code T} that can
* be mutated into the result. In that case, this function must return {@code initial}. If any
* other strategy is used, then {@code initial} will be null and this function must instantiate
* the value to return.
*/
T deserializePayload(
DeserializationContext context,
@Nullable T initial,
CodedInputStream codedIn,
Deserializer deserializer)
throws SerializationException, IOException;
}
/** A context for serializing; wraps a memo table. */
public static class Serializer {
private final SerializingMemoTable memo = new SerializingMemoTable();
public