Skip to content

Reading Variable Length File with OCCCURS DEPENDING #666

@pinakigit

Description

@pinakigit

Hi,

We are sending file from Mainframe to ADLS through FTP in binary mode. The binary data we are reading through Cobrix and creating parquet file out of it. The FB files are working like a charm.

spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").load("/data/example1/data")

We can also read VB Files with occur clause

spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("variable_size_occurs", "true").load("/data/example1/data")

Question

Below is a sample copybook which is for a variable length file. It has 2 record formats i.e if the value of the field PKLR1-PAR-PEN-REG-CODE is 'N' then the data has values till the field PKLR1-VALUATION-CODE and if it is 'Y' then it maps to the entire copybook including the OCCURS DEPNDING CLAUSE.

01 PKLR1-DETAIL-LOAN-RECORD.
10 PKLR1-BASIC-SECTION.
20 PKLR1-SORT-CONTROL-FIELD.
30 PKLR1-USER-IDENT PIC X(1).
30 PKLR1-EXTRACT-CODE PIC X(1).
88 PKLR1-DATA-RECORD VALUE '0'.
88 PKLR1-END-OF-FILE VALUE '9'.
30 PKLR1-SECTION PIC X(1).
30 PKLR1-TYPE PIC X(1).
30 PKLR1-NUMERIC-STATE-CODE PIC X(2).
30 PKLR1-CONTRACT-NUMBER PIC X(10).
20 PKLR1-PAR-PEN-REG-CODE PIC X(1).
88 PKLR1-PAR VALUE 'Y'.
88 PKLR1-NAPR VALUE 'N'.
20 PKLR1-VALUATION-CODE.
30 PKLR1-MORTALITY-TABLE PIC X(2).
30 PKLR1-LIVES-CODE PIC X(1).
30 PKLR1-FUNCTION PIC X(1).
30 PKLR1-VAL-INTEREST PIC S9(2)V9(3) COMP-3.
30 PKLR1-MODIFICATION PIC X(1).
30 PKLR1-INSURANCE-CLASS PIC X(1).
30 PKLR1-SERIES PIC X(5).
20 PKLR1-POLICY-STATUS PIC X(2).
20 PKLR1-PAR-CODES.
30 PKLR1-PAR-TYPE PIC X(1).
30 PKLR1-DIVIDEND-OPTION PIC X(1).
30 PKLR1-OTHER-OPTION PIC X(1).
20 PKLR1-ALPHA-STATE-CODE PIC X(2).
20 PKLR1-OUT-LOC-DTLS OCCURS 1 TO 5 TIMES
DEPENDING ON PKLR1-OUT-NO-OF-LOC.
30 PKLR1-OUT-LOC PIC X(10).
30 PKLR1-OUT-LOC-QTY PIC S9(9) COMP-3.

Query 1:
How can I read this file? I tried the below thigs and seems nothing is working.

  1. spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("variable_size_occurs", "true").load("/data/example1/data")- Doesn't pull any record
  2. spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("record_format", "V").option("variable_size_occurs", "true").load("/data/example1/data") - Doesn't pull any record
  3. spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("PKLR1-PAR-PEN-REG-CODE", "Y").option("variable_size_occurs", "true").load("/data/example1/data") - It pull the record properly till it encounter a record with value PKLR1-PAR-PEN-REG-CODE = 'N' and it doesn't pull any record after that
  4. Created another copybook test4_copybook which will have fields till PKLR1-SERIES
    spark.read.format("cobol").option("copybook", "/data/example1/test4_copybook.cob").option("PKLR1-PAR-PEN-REG-CODE", "Y").option("variable_size_occurs", "true").load("/data/example1/data") - It pull the record properly till it encounter a record with value PKLR1-PAR-PEN-REG-CODE = 'Y' and it doesn't pull any record after that

How can I read this file and create a parquet file out of it.

Query2:
When writing the parquet file the field is getting created as an array and struct. Is there a way I can flatten it i.e. it will always create
5 occurences of fiels PKLR1-OUT-LOC, PKLR1-OUT-LOC-QTY i.e. PKLR1-OUT-LOC1, PKLR1-OUT-LOC-QTY1, PKLR1-OUT-LOC2, PKLR1-OUT-LOC-QTY2 ,PKLR1-OUT-LOC3, PKLR1-OUT-LOC-QTY3 ,PKLR1-OUT-LOC4, PKLR1-OUT-LOC-QTY4 ,PKLR1-OUT-LOC5, PKLR1-OUT-LOC-QTY5 and depending on PKLR1-OUT-NO-OF-LOC these fields will be populated or set as NULL.

Query 3:
How do I when I receive the file in ADLS whether it's coming as VB or FB. Tried using the VB header examples (have both BDW and RDW headers) and it throw3s error as BDW header have non-zero values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceptedAccepted for implementationenhancementNew feature or requestquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions