Skip to content

Implement variable length arrays (OCCURS) #156

@jpdev-sb

Description

@jpdev-sb

@yruslan - Thanks so much with your help on #147 . We were able to re-export the data to include the RDW. However, we're still facing some issues.

Background

I'm reading a single file with two records that use the same copybook. However, when I trie to save the dataframe to JSON, I see four records and sections of the JSON that should include repeating values (i.e. sections of the copybook that use OCCURS...DEPENDING ON) are empty.

The relevant sections of the copybook are here.

           02 FI-IP-SNF-CLM-REC.
             04 FI-IP-SNF-CLM-FIX-GRP.
               06 CLM-REC-IDENT-GRP.
                 08 REC-LNGTH-CNT          PIC S9(5) COMP-3.
           ...
               06 IP-REV-CNTR-CD-I-CNT     PIC 99.
           ...
               06 CLM-REV-CNTR-GRP      OCCURS 0 TO 45 TIMES
                          DEPENDING ON IP-REV-CNTR-CD-I-CNT
                          OF FI-IP-SNF-CLM-REC.
           ...

Cobrix logs the following.

-------- FIELD LEVEL/NAME --------- --ATTRIBS--    FLD  START     END  LENGTH

FI_IP_SNF_CLM_REC                                            1  31656  31656
  4 FI_IP_SNF_CLM_FIX_GRP                           244      1   2058   2058
    6 CLM_REC_IDENT_GRP                               7      1      8      8
      8 REC_LNGTH_CNT                                 3      1      3      3
  ...
    6 IP_REV_CNTR_CD_I_CNT             D            153   1249   1250      2
  ...
    6 CLM_REV_CNTR_GRP                 []           360   4384  31653  27270
  ...

Here's my code:

    val inpk_df = spark
      .read
      .format("cobol")
      .option("copybook", "data/UTLIPSNK.txt")
      .option("generate_record_id", true)
      .option("is_record_sequence", "true")
      .option("is_rdw_big_endian", "true")
      .load("data/in/file1")
    inpk_df.write.json("data/out/file1")

This produces JSON that looks like this.

{
  "File_Id": 0,
  "Record_Id": 0,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 1,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 2,
  "FI_IP_SNF_CLM_REC": {...}
}
{
  "File_Id": 0,
  "Record_Id": 3,
  "FI_IP_SNF_CLM_REC": {...}
}

First Question

So, the first question is, why is it creating four records and not two? If I omit the .option("is_rdw_big_endian", "true"), then I see this error.

java.lang.IllegalStateException: RDW headers should never be zero (64,7,0,0). Found zero size record at 4.
	at za.co.absa.cobrix.cobol.parser.headerparsers.RecordHeaderParserRDW.processRdwHeader(RecordHeaderParserRDW.scala:82)
...

Now, the REC_LNGTH_CNT should contain the actual record length. It's value for the two records is 16,387 and 13,950, respectively. I tried to use that rather than the RDW, as follows.

...
//      .option("is_record_sequence", "true")
//      .option("is_rdw_big_endian", "true")
      .option("record_length_field", "FI-IP-SNF-CLM-REC.FI-IP-SNF-CLM-FIX-GRP.CLM-REC-IDENT-GRP.REC-LNGTH-CNT")
...

But, I got this error.

java.lang.IllegalStateException: Record length value of the field REC_LNGTH_CNT must be an integral type.
	at za.co.absa.cobrix.spark.cobol.reader.varlen.iterator.VarLenNestedIterator.fetchRecordUsingRecordLengthField(VarLenNestedIterator.scala:143)
...

Is that because this field is defined as PIC S9(5) COMP-3. in the copybook?

I'm guessing there is a mismatch between what the RDW is indicating and the actual data. Do you have some pointers for troubleshooting that and working around it?

Second Question

The second question is, how come the nested JSON array isn't populated for the variable length field values?

The value of the IP-REV-CNTR-CD-I-CNT field in the JSON for the first record looks like this:

...
"IP_REV_CNTR_CD_I_CNT": 23,
...

So, I expect 23 records to be populated. However, the value of the "CLM_REV_CNTR_GRP" key is an array of 23 elements, but they are all empty. The first 20 elements are all objects where each key has an empty value. The last three are just empty objects.

Any ideas?

Thanks so much for your help!!!

Metadata

Metadata

Assignees

Labels

acceptedAccepted for implementationenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions