Add support for dfdl:lengthKind="endOfParent"#1652
Conversation
- Implemented logic to handle elements with `dfdl:lengthKind="endOfParent"`. - Added validations to enforce schema/spec constraints and error conditions specific to `endOfParent` lengthKind. - Introduced determination of effective length units for parent elements and related error checks. - removed NYI Errors - add tests for endOfParent elements with different LengthKinds incl nested EndOfParent DAFFODIL-238
332415b to
7e7bd04
Compare
dfdl:lengthKind="endOfParent"dfdl:lengthKind="endOfParent"
stevedlawrence
left a comment
There was a problem hiding this comment.
Looks great, surprise how much we kindof already have support for this with some tweaks.
| case e: ElementBase => Some(e) | ||
| case ge: GlobalElementDecl => Some(ge.asRoot) | ||
| case s: SequenceTermBase => s.immediatelyEnclosingElementParent | ||
| case c: ChoiceTermBase => c.immediatelyEnclosingElementParent |
There was a problem hiding this comment.
Do we need to return the choice in some cases? It looks like the logic for EOP sometimes cares about the choice so I'm not sure we can bypass this?
| </tdml:dfdlInfoset> | ||
| </tdml:infoset> | ||
| </tdml:parserTestCase> | ||
|
|
There was a problem hiding this comment.
Some lines from the spec that I'm not sure I saw tests for:
The dfdl:lengthKind 'endOfParent' can also be used on the document root to allow the last element to consume the data up to the end of the data stream.
I assume this means something like:
<element name="root" dfdl:lengthKind="endOfParent">
<complexType>
<sequence>
<element name="field1" ... />
<element name="rest" dfdl:lenghtKind="endOfParent" ... />
</sequence>
</complexType>
</element>So rest should contain everything up until the end of the data. That doesn't really work will with the assuming that we'll always have a bit limit, since the root element won't every set bitLimit in this case.
A simple element must have one of [...] dfdl:representation 'binary' and a packed decimal representation
I don't think there are any tests for packed binary formats, and it looks like there aren't any modifications to the code to support packed types
A simple element must have one of [...] dfdl:representation 'text'
I think this implies that you can have simple types with bool/numbers/dates/times/etc as long as representation is text. I think these should all work because they just add a converter to a specified length string parser, but we should have tests for them.
The dfdl:lengthKind 'endOfParent' means that the element is terminated [...] or the end of an enclosing choice with dfdl:choiceLengthKind ‘explicit’.
choiceLengthKind="explicit" isn't used very often, but we should probably have a couple tests to make sure this works with EOP children.
| notYetImplemented("lengthKind='endOfParent' for complex type") | ||
| case LengthKind.EndOfParent => | ||
| notYetImplemented("lengthKind='endOfParent' for simple type") | ||
| case LengthKind.EndOfParent => new SpecifiedLengthEndOfParent(this, body) |
There was a problem hiding this comment.
I think this is is only needed for complex types? My thinking is that simple types with lengthKind EOP should have parent parser (ether complex or explicity length choice) that already set the bit limit via one of these specified length parsers. So the bit limit has already been set correctly and we don't need another parser to do that.
But this is needed for complex types with EOP since they need to skip any bits up to their parents bit limit that thier children might not have consumed.
So I think this wants to be:
case LengthKind.EndOfParent if isComplexType => new SpecifiedLengthEndOfParent(this, body)And then bodyRequiresSpecifiedLength wants to be modified to make it so it evaluates to false if this is a simple type with lengthKind EOP.
There was a problem hiding this comment.
Hmm what if the simpleType is a root element with lengthKind EOP (where the user intends it to go to the end of the datastream)?
There was a problem hiding this comment.
Good point. So maybe it instead wants to be something like this?
case LengthKind.EndOfParent => {
Assert.invariant(isComplexType || isRoot)
new SpecifiedLengthEndOfParent(this, body)
}| } | ||
| case None if this.isInstanceOf[Root] => LengthUnits.Characters | ||
| case _ => | ||
| Assert.invariantFailed( |
There was a problem hiding this comment.
I feel like this invariant might break if we have something like a global element decl with a child with EOP. That EOP will want to reach up to find where it's used but wont' be able to find a parent because it only looks lexically.
There was a problem hiding this comment.
Do you mean a child that's an element reference? Do we have any other way to look at a parent that's not optLexicalParent? Even immediatelyEnclosingGroupDef uses optLexicalParent.
There was a problem hiding this comment.
Added the below test and it works as expected where LK is EOP
<xs:element name="text_string_txt_bytes" type="xs:string" dfdl:lengthUnits="bytes" nillable="true" />
<xs:element name="text_string_txt_ref2">
<xs:complexType>
<xs:sequence>
<xs:element ref="ex:text_string_txt_bytes"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="text_string_txt_ref3">
<xs:complexType>
<xs:sequence>
<xs:element ref="ex:text_string_txt_ref2"/>
</xs:sequence>
</xs:complexType>
</xs:element>
There was a problem hiding this comment.
I might I'm just forgetting how optLexicalParent works.
Refreshing my memory, it looks like the GlobalElementDecl DSOM object that represents the global definition does not have an optLexicalParent (or I guess more correctly it does, but it is the SchemaDocument).
But we also have an "ElementRef" DSOM object that represents the local element ref to that global definition. And the "ElementRef" is what is in the DSOM tree.
So as long as these functions are run within the context of ElementRef then maybe these invariants hold.
But I think maybe things get tricky if we try to recursively look up multiple parents though?
For example, if we are in the context of the ElementRef(text_string_txt_bytes) and ask for its optLexicalParent we'll get the global GlobalElemenDecl(text_string_txt_ref2). But if we then recursively ask for that GlobalElementDecl's optLexicalParent we'll get a SchemaDocument.
Similarly, say we have a group like this:
<group name="foo">
<sequence>
<element name="someEopElement" ... />
</sequence>
</group>Asking for the optLexicalParent of someEopElement returns the GlobalSequenceGroupDef for foo, who's optLexicalParent is the SchemaDocument and not whatever references the group.
And that makes sense because multiple different things could reference the group, so we don't really know which parent to examine.
So I'm not really sure how recursively looking up parents can work. The recrusion essentially ends at the global definition. It works fine if the thing you are looking for is within the scope of of your global element (e.g. Element Ref is always inside an Element, but I think it breaks once groups get involved. Unless those are handelded somehow else, and maybe I'm just looking at the wrongs spot when I'm inspeting values of optLexicalparent?
| case _ => // do nothing | ||
| } | ||
| schemaDefinitionWhen( | ||
| representation == Representation.Text && knownEncodingWidthInBits != 8 && parentEffectiveLengthUnits != LengthUnits.Characters, |
There was a problem hiding this comment.
I did some testing and change the CVS schema to this:
<element name="file" dfdl:lengthKind="explicit" dfdl:length="10" dfdl:terminator="%NL;">
<complexType>
<sequence>
<element name="field" type="xs:string" dfdl:lengthKind="endOfParent" />
</sequence>
</complexType>
</element>
And a got this stack trace:
org.apache.daffodil.lib.exceptions.Abort: Invariant broken: KnownEncodingMixin.this.isKnownEncoding
org.apache.daffodil.lib.exceptions.Assert$.abort(Assert.scala:153)
org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingName(EncodingRuntimeData.scala:56)
org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingName$(EncodingRuntimeData.scala:46)
org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingName$lzyINIT1(LocalElementDecl.scala:25)
at org.apache.daffodil.lib.exceptions.Assert$.abort(Assert.scala:153)
at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingName(EncodingRuntimeData.scala:56)
at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingName$(EncodingRuntimeData.scala:46)
at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingName$lzyINIT1(LocalElementDecl.scala:25)
at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingName(LocalElementDecl.scala:25)
at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingCharset(EncodingRuntimeData.scala:62)
at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingCharset$(EncodingRuntimeData.scala:46)
at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingCharset$lzyINIT1(LocalElementDecl.scala:25)
at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingCharset(LocalElementDecl.scala:25)
at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingWidthInBits(EncodingRuntimeData.scala:81)
at org.apache.daffodil.runtime1.processors.KnownEncodingMixin.knownEncodingWidthInBits$(EncodingRuntimeData.scala:46)
at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingWidthInBits$lzyINIT1(LocalElementDecl.scala:25)
at org.apache.daffodil.core.dsom.LocalElementDeclBase.knownEncodingWidthInBits(LocalElementDecl.scala:25)
at org.apache.daffodil.core.grammar.ElementBaseGrammarMixin.checkEndOfParentElem(ElementBaseGrammarMixin.scala:325)
at org.apache.daffodil.core.grammar.ElementBaseGrammarMixin.checkEndOfParentElem$(ElementBaseGrammarMixin.scala:49)
at org.apache.daffodil.core.dsom.LocalElementDeclBase.checkEndOfParentElem$lzyINIT1(LocalElementDecl.scala:25)
at org.apache.daffodil.core.dsom.LocalElementDeclBase.checkEndOfParentElem(LocalElementDecl.scala:25)
I think the issue is that the default encoding used by csv is UTF-8, which seems to cause problems here.
There was a problem hiding this comment.
I'm not able to replicate this. And I think CSV's default encoding is ASCII?
There was a problem hiding this comment.
It looks like it uses the $dfdl:encoding variable which I think defaults to UTF-8
https://github.com/DFDLSchemas/CSV/blob/master/src/csv-base-format.dfdl.xsd#L48
That's a relatively new change to CSV, maybe you're using an older version?
There was a problem hiding this comment.
Looks like the issue is specifically due to the use of dfdl:encoding var..replacing with UTF-8 works as expected. Will investigate
<xs:element name="file" dfdl:lengthKind="explicit" dfdl:length="10" dfdl:terminator="%NL;" dfdl:encoding="{$dfdl:encoding}">
<xs:complexType>
<xs:sequence>
<xs:element name="field" type="xs:string" dfdl:lengthKind="endOfParent" dfdl:encoding="{$dfdl:encoding}" />
</xs:sequence>
</xs:complexType>
</xs:element>| schemaDefinitionWhen( | ||
| hasTerminator, | ||
| "%s is specified as dfdl:lengthKind=\"endOfParent\", but specifies a dfdl:terminator.", | ||
| context |
There was a problem hiding this comment.
I don't think we need to include context in the error string. I believe the error context is capture and output as part of the SDE.
| case LengthKind.EndOfParent => LengthMultipleOf(1) // NYI | ||
| case LengthKind.EndOfParent => | ||
| eb.immediatelyEnclosingElementParent match { | ||
| case Some(parent) => parent.elementSpecifiedLengthApprox |
There was a problem hiding this comment.
I don't think this is quite right. The length of this element isn't the same as the parent length, it's whatever is left over of the parent after the previous siblings.
So this elements length is kindof parent.elementSpecifiedLenghtApprox - priorAlignmentApprox (i.e the length of the parent minus wherever we are starting) but we can't just subtract approx things since they are potentially multiples.
That said, I wonder if we don't really need to get this elements approx length perfect, because no elements come after it, and the endingAlignApprox of the parent won't need this specific is going to be known since it has an explicit length? Maybe this just becomes LengthMultipleOf 1 or 8 depending on length units? This might need some more thought...
There was a problem hiding this comment.
Hmm Length Units don't apply for endOfParent, so we might be fine leaving it as LengthMultipleOf(1)
There was a problem hiding this comment.
But then if its parent has siblings, does that make an impact on us not approximating?
There was a problem hiding this comment.
Yeah, I think you're right that we can just say length approx of an EOP element is multiple of 1.
In the case of a parent of an EOP element having siblings, I think the parent will calculate contentEndAlignment to figure out where it ends to figure out what alignment is needed for its sibling. And we have this code that handles that:
case eb: ElementBase => {
val res = if (eb.isComplexType && eb.lengthKind == LengthKind.Implicit) {
eb.complexType.group.contentEndAlignment
} else {
// simple type or complex type with specified length
contentStartAlignment + elementSpecifiedLengthApprox
}
res
}I don't think implicit is allowed to a be a parent of an EOP child, so we would fall into the contentStartAlignment + elementSpecifiedLengthApprox branch.
And I think elementSpeciifedLengthApprox will calculate the length of the parent and know where it ends to figure out alignment for the sibling--it doesn't even care where the children ended. So when we have EOP, we kindof don't really care about the approx length of the child, because the paren't already knows what the length will be, so having the EOP child being LengthMultipleOf(1) doesn't really matter for the parent.
So I think we are okay and will correctly detect any needed alignment?
- Add checkEndOfParentRestriction to choice/sequence/elements - employ top-down approach looking from parents to their children instead of lexical child to parent approach - address checks for root elements - add support for EOP for packed decimals - add support for bytesTillEndOfDataStream for BBIS/BIS - update tests
- consolidate checkEndOfParentRestrictions in TermGrammarMixin - add more tests - clean up commented code
stevedlawrence
left a comment
There was a problem hiding this comment.
Looks really nice, lots of little comments or suggestions. I think my main question surrounds how we check there are nothing following EOP elements, even as siblings or ancestor siblings.
|
|
||
| requiredEvaluationsAlways(root) | ||
| requiredEvaluationsAlways(checkForDuplicateTopLevels()) | ||
| requiredEvaluationsAlways(root.checkEndOfParentRestrictions(Some(LengthUnits.Characters))) |
There was a problem hiding this comment.
Is it possible to put this on the Root class? It feels like that' the right place for this check and any checks that algorithmically need to start from the root like this one.
| .getOrElse(false) | ||
| } | ||
|
|
||
| final lazy val immediatelyEnclosingParentForEOPElem: Option[Term] = { |
There was a problem hiding this comment.
Can this be removed? It doesn't look like it's used anywhere, and it is has subtle edge cases where it doesn't necessarily return what one might think it would because it only knows about lexical parents.
| case eb: ElementBase => IndexedSeq(eb) | ||
| case mg: ModelGroup if mg.groupMembers.isEmpty => IndexedSeq(mg) | ||
| case s: SequenceTermBase => s.flattenedChildren | ||
| case c: ChoiceTermBase => c.flattenedChildren |
There was a problem hiding this comment.
I'm not sure this is correct for ChoiceTermBase, at least in the context where this is used to detect if there are siblings after this. For example, say we have a schema below, where the first branch of a choice contains an EOP element:
<element name="foo">
<complexType>
<choice>
<element name="bar" dfdl:lengthKind="endOfParent" ... />
<element name="baz" dfdl:lenghtKind="explicit" ... />
</choice>
</complexType>
</element>In this case, I think the choices flattenedChildren will look like:
IndexedSeq(bar, baz)
And so it will look like baz follows an EOP element and would SDE. But I think this this schema is legal?
Since choiceLengthKind doesn't allow EOP, maybe we just return Nil here, similar to realElementChildren? And maybe choices need to run their checks on individual branches? I'm not sure we can just flatten all branches of a choice into a single list of children.
Or maybe we need a different way to determine if there are things after an EOP element that doesn't require flattening. Maybe checkEndOfParentRestrictions returns a boolean that indicates if it saw an EOP element, and then as we roll up from the recursion we check the result? For xample, maybe something like:
def checkEndOfParentRestrcitions(...) {
...
val sawEOP = term.termChildren.foldLeft(false) { case (child, sawEOP)) =>
if (sawEOP) {
// the previous child saw an EOP element, and this child is after it, so error
SDE(...)
}
child.checkEndOfParentRestrictions(...)
}
sawEOP
}This doesn't handle non-represented elements or choice branches, but maybe something along those lines? This is also nice in that it doesn't require allocating any new lists like the flattenedChildren. We're already having to walking the entire tree, so if we can do logic just be return values or something that would be preferred. It's not the standard way we do things, but this parent/sibling logic this reuires kindof makes the normal way difficult.
| } | ||
| } | ||
|
|
||
| lazy val flattenedChildren: IndexedSeq[Term] = termChildren.flatMap { c => |
There was a problem hiding this comment.
Does an IndexedSeq provide much benefit with how this is used? I don't think we will usually know what index we an element is, so we'll likely have to do a linear scans regardless. And the way this is used it looks like we do indexOf and then get everything after that index. If this were a normal Seq, I think similar logic could be implemented like this:
val elementPlusNextSiblings = flattendChildren.dropWhile(_ != specificEopChild)
Assert.invariant(!elementPlusNextSibligns.isEmpty)
val nextSiblings = elementPlusNextSiblings.tail
if (nextSiblings.isDefined) {
// SDE, children exist afterEOPelement
}We don't really use the indexability, so this might avoid some overheads.
| } | ||
| } | ||
|
|
||
| def checkChildrenForSiblingsAfterEOPElement(specificChild: ElementBase) = { |
There was a problem hiding this comment.
One thing worth considering is this line in the spec:
It is a Schema Definition Error if:
...
- any other represented element is defined between this element and the end of the enclosing component.
I think the implication it is that it is legal to have non-represented elements (i.e. IVC's) that appear after a an EOP element. So for example, this is legal:
<element name="foo" dfdl:lengthKind="explicit" ...>
<complexType>
<sequence>
<element name="bar" dfdl:lengthKind="endOfParent" ... />
<element name="baz" dfdl:inputValueCalc="{ ... }" ... />
</sequence>
</complexType>
</element>Maybe this can be done by just excluding IVC elements form flattenedChildren, since we essentially ignore them in the context of checking for siblings after EOP?
| position(initialBytePosition) | ||
| throw new Exception( | ||
| s"Attempted to fill to end of data stream, but did not reach end of data stream before maxCacheSizeInBytes: ${maxCacheSizeInBytes}." | ||
| ) |
There was a problem hiding this comment.
I'm wondering if we should return a Option/Maybe here instead of throwing an exception. That way we can let callers decide how to handle the cases where an implementation couldn't find EOD. I imagine in most cases they'll want to create an SDE, but maybe there are other options. And we can just document that the function might return None if the end of data stream couldn't be found.
| val br = byteBuffer.remaining | ||
| position(initialPosition) | ||
| br | ||
| } |
There was a problem hiding this comment.
byterBuffer.remaining doesn't change the position so we don't need to capture and restore it.
Note that if we switch to endOfDataPosition, that could be calculated when the class is initialized, and it would just be something like override val endOfDataPostion = bb.remaining()
| */ | ||
| def hasReachedEndOfData: Boolean = inputSource.hasReachedEndOfData | ||
|
|
||
| lazy val bytesTillEndOfDataStream: Long = inputSource.bytesTillEndOfDataStream |
There was a problem hiding this comment.
For consistency, I'd suggest we make this a def. This is essentially a delegate to the underlying implementation, and we can let them decide best how the result show be stored or recalcuated.
| len | ||
| } else if (isEndOfParent) { | ||
| val byteLimit = pstate.dataInputStream.bytesTillEndOfDataStream | ||
| byteLimit * 8 |
There was a problem hiding this comment.
Is it possible for our current bitPosition to not on a byte boundary? Or do all EOP types required byte alignment?
I'm not sure if packed decimals can by half-byte aligned or not (I think they can?), but I'm pretty xs:hexBinary does not require a specific alignment. For example:
<element name="root" dfdl:lengthKind="endOfParent">
<complexType>
<sequene>
<element name="foo" type="xs:int" dfdl:lengthKind="4" dfdl:lengthUnits="bits" ... />
<element name="leftOver" type="xs:hexBinary" dfdl:lengthKind="endOfParent" ... />
</sequence>
</complexType>
</element>This feels like a reasonable way to capture leftOver data, even if we aren't byte aligned.
I think this is maybe another reason in favor for switching to something like endOfDataPosition, then this becomes something like
val dis = pstate.dataInputStream
val len = (dis.endOfDataPosition * 8) - dis.curBitPos0bWhich works as expected even if curBitPosition is not byte aligned.
|
|
||
| override protected def getBitLength(s: PState): MaybeULong = { | ||
| MaybeULong(super[BitLengthFromBitLimitMixin].getBitLength(s)) | ||
| } |
There was a problem hiding this comment.
I'm wondering if we need something different for SpecifiedLengthEndOfParentParser, at least the context of complex types. If we look at SpecifiedLengthParserBase, then logic it follows for complex types is essentially
- getBitLength
- ignored checking isDefinedForLength
- withBitLengthLimit { children.parse() }
- skip any remaining bits not consumed by children
Note that the reason for step 2 to ignore isDefinedForLength is because isDefinedForLength will cause the underlying input source too fill a bunch of buckets for the full length of the complex types, but there are limits to how much we can bucket.
Instead, we just call all the children, have them slowly read all the bytes as needed, and then skip whatever is left over at the end. So we don't require the full length of the complex type until we finish parsing the children.
But say we have an EOP complex root element. The first thing this does is call getBiLength which will to scan the entire stream for the EOD, which does exactly what we were trying to avoid by skipping isDefinedForLength in step 2.
So it feels like we need special logic for complex EOP elements so that they do something different if there is no current bit limit. For example, the logic wants to be more like this for complex EOP elements:
if (bitLimit.isDefined) {
val len = getBitLength // will use bitLimit
withBitLengthLength { children.parse }
val reminaingBits = ...
dis.skip(remaiingBits)
} else {
children.parse()
val len = getBitLength // will scan for EOD
val reminaingBits = ...
dis.skip(remainingBits)
}So we kindof need to reverse the order of getBitLengthand children.parse.
I'm not sure if it makes sense to change SpecifiedLengthParserBase to know about EOP and check for bitLimit or SpecifiedLengthEndOfParentParser wants to be it's own thing, maybe with helper functions that the Base parser has for the common things like skipping remaining bits.
Note this is only an issue for complex types, simple types need the length no matter what.
dfdl:lengthKind="endOfParent".endOfParentlengthKind.DAFFODIL-238