-
Notifications
You must be signed in to change notification settings - Fork 82
Description
@yruslan - Thanks so much with your help on #147 . We were able to re-export the data to include the RDW. However, we're still facing some issues.
Background
I'm reading a single file with two records that use the same copybook. However, when I trie to save the dataframe to JSON, I see four records and sections of the JSON that should include repeating values (i.e. sections of the copybook that use OCCURS...DEPENDING ON) are empty.
The relevant sections of the copybook are here.
02 FI-IP-SNF-CLM-REC.
04 FI-IP-SNF-CLM-FIX-GRP.
06 CLM-REC-IDENT-GRP.
08 REC-LNGTH-CNT PIC S9(5) COMP-3.
...
06 IP-REV-CNTR-CD-I-CNT PIC 99.
...
06 CLM-REV-CNTR-GRP OCCURS 0 TO 45 TIMES
DEPENDING ON IP-REV-CNTR-CD-I-CNT
OF FI-IP-SNF-CLM-REC.
...
Cobrix logs the following.
-------- FIELD LEVEL/NAME --------- --ATTRIBS-- FLD START END LENGTH
FI_IP_SNF_CLM_REC 1 31656 31656
4 FI_IP_SNF_CLM_FIX_GRP 244 1 2058 2058
6 CLM_REC_IDENT_GRP 7 1 8 8
8 REC_LNGTH_CNT 3 1 3 3
...
6 IP_REV_CNTR_CD_I_CNT D 153 1249 1250 2
...
6 CLM_REV_CNTR_GRP [] 360 4384 31653 27270
...
Here's my code:
val inpk_df = spark
.read
.format("cobol")
.option("copybook", "data/UTLIPSNK.txt")
.option("generate_record_id", true)
.option("is_record_sequence", "true")
.option("is_rdw_big_endian", "true")
.load("data/in/file1")
inpk_df.write.json("data/out/file1")
This produces JSON that looks like this.
{
"File_Id": 0,
"Record_Id": 0,
"FI_IP_SNF_CLM_REC": {...}
}
{
"File_Id": 0,
"Record_Id": 1,
"FI_IP_SNF_CLM_REC": {...}
}
{
"File_Id": 0,
"Record_Id": 2,
"FI_IP_SNF_CLM_REC": {...}
}
{
"File_Id": 0,
"Record_Id": 3,
"FI_IP_SNF_CLM_REC": {...}
}
First Question
So, the first question is, why is it creating four records and not two? If I omit the .option("is_rdw_big_endian", "true")
, then I see this error.
java.lang.IllegalStateException: RDW headers should never be zero (64,7,0,0). Found zero size record at 4.
at za.co.absa.cobrix.cobol.parser.headerparsers.RecordHeaderParserRDW.processRdwHeader(RecordHeaderParserRDW.scala:82)
...
Now, the REC_LNGTH_CNT
should contain the actual record length. It's value for the two records is 16,387 and 13,950, respectively. I tried to use that rather than the RDW, as follows.
...
// .option("is_record_sequence", "true")
// .option("is_rdw_big_endian", "true")
.option("record_length_field", "FI-IP-SNF-CLM-REC.FI-IP-SNF-CLM-FIX-GRP.CLM-REC-IDENT-GRP.REC-LNGTH-CNT")
...
But, I got this error.
java.lang.IllegalStateException: Record length value of the field REC_LNGTH_CNT must be an integral type.
at za.co.absa.cobrix.spark.cobol.reader.varlen.iterator.VarLenNestedIterator.fetchRecordUsingRecordLengthField(VarLenNestedIterator.scala:143)
...
Is that because this field is defined as PIC S9(5) COMP-3.
in the copybook?
I'm guessing there is a mismatch between what the RDW is indicating and the actual data. Do you have some pointers for troubleshooting that and working around it?
Second Question
The second question is, how come the nested JSON array isn't populated for the variable length field values?
The value of the IP-REV-CNTR-CD-I-CNT
field in the JSON for the first record looks like this:
...
"IP_REV_CNTR_CD_I_CNT": 23,
...
So, I expect 23 records to be populated. However, the value of the "CLM_REV_CNTR_GRP"
key is an array of 23 elements, but they are all empty. The first 20 elements are all objects where each key has an empty value. The last three are just empty objects.
Any ideas?
Thanks so much for your help!!!