Reading Variable Length File with OCCCURS DEPENDING

Hi,

We are sending file from Mainframe to ADLS through FTP in binary mode. The binary data we are reading through Cobrix and creating  parquet file out of it. The FB files are working like a charm.

spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").load("/data/example1/data")

We can also read VB Files with occur clause

spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("variable_size_occurs", "true").load("/data/example1/data")

## Question
Below is a sample copybook which is for a variable length file. It has 2 record formats i.e if the value of the field PKLR1-PAR-PEN-REG-CODE  is 'N'  then the data has values till the field PKLR1-VALUATION-CODE and if it is 'Y' then it maps to the entire copybook including the OCCURS DEPNDING CLAUSE.

01   PKLR1-DETAIL-LOAN-RECORD.                                   
    10  PKLR1-BASIC-SECTION.                                     
        20  PKLR1-SORT-CONTROL-FIELD.                            
            30  PKLR1-USER-IDENT         PIC X(1).               
            30  PKLR1-EXTRACT-CODE       PIC X(1).               
                88  PKLR1-DATA-RECORD            VALUE '0'.      
                88  PKLR1-END-OF-FILE            VALUE '9'.      
            30  PKLR1-SECTION            PIC X(1).               
            30  PKLR1-TYPE               PIC X(1).               
            30  PKLR1-NUMERIC-STATE-CODE PIC X(2).               
            30  PKLR1-CONTRACT-NUMBER    PIC X(10).              
        20  PKLR1-PAR-PEN-REG-CODE       PIC X(1).               
                88  PKLR1-PAR            VALUE 'Y'.      
                88  PKLR1-NAPR            VALUE 'N'.      
        20  PKLR1-VALUATION-CODE.                                
            30  PKLR1-MORTALITY-TABLE    PIC X(2).               
            30  PKLR1-LIVES-CODE         PIC X(1).               
            30  PKLR1-FUNCTION           PIC X(1).               
            30  PKLR1-VAL-INTEREST       PIC S9(2)V9(3) COMP-3.  
            30  PKLR1-MODIFICATION       PIC X(1).               
            30  PKLR1-INSURANCE-CLASS    PIC X(1).               
            30  PKLR1-SERIES             PIC X(5).               
        20  PKLR1-POLICY-STATUS          PIC X(2).               
        20  PKLR1-PAR-CODES.                                     
            30  PKLR1-PAR-TYPE           PIC X(1).               
            30  PKLR1-DIVIDEND-OPTION    PIC X(1).               
            30  PKLR1-OTHER-OPTION       PIC X(1).               
        20  PKLR1-ALPHA-STATE-CODE       PIC X(2). 
	20  PKLR1-OUT-LOC-DTLS OCCURS 1 TO 5 TIMES                        
            DEPENDING ON PKLR1-OUT-NO-OF-LOC.              
	    30 PKLR1-OUT-LOC             PIC X(10).                      
            30 PKLR1-OUT-LOC-QTY         PIC S9(9) COMP-3.

Query 1:
How can I read this file? I tried the below thigs and seems nothing is working.

1. spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("variable_size_occurs", "true").load("/data/example1/data")- Doesn't pull any record
2.  spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("record_format", "V").option("variable_size_occurs", "true").load("/data/example1/data") - Doesn't pull any record
3. spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("PKLR1-PAR-PEN-REG-CODE", "Y").option("variable_size_occurs", "true").load("/data/example1/data") - It pull the record properly till it encounter a record with value PKLR1-PAR-PEN-REG-CODE = 'N' and it doesn't pull any record after that
4. Created another copybook test4_copybook which will have fields till PKLR1-SERIES 
spark.read.format("cobol").option("copybook", "/data/example1/test4_copybook.cob").option("PKLR1-PAR-PEN-REG-CODE", "Y").option("variable_size_occurs", "true").load("/data/example1/data") - It pull the record properly till it encounter a record with value PKLR1-PAR-PEN-REG-CODE = 'Y' and it doesn't pull any record after that

How can I read this file and create a parquet file out of it.

Query2:
When writing the parquet file the field is getting created as an array and struct. Is there a way I can flatten it i.e. it will always create 
5 occurences of fiels PKLR1-OUT-LOC, PKLR1-OUT-LOC-QTY  i.e. PKLR1-OUT-LOC1, PKLR1-OUT-LOC-QTY1, PKLR1-OUT-LOC2, PKLR1-OUT-LOC-QTY2 ,PKLR1-OUT-LOC3, PKLR1-OUT-LOC-QTY3 ,PKLR1-OUT-LOC4, PKLR1-OUT-LOC-QTY4 ,PKLR1-OUT-LOC5, PKLR1-OUT-LOC-QTY5 and depending on  PKLR1-OUT-NO-OF-LOC these fields will be populated or set as NULL.

Query 3:
How do I  when I receive the file in ADLS whether it's coming as VB or FB. Tried using the VB header examples (have both BDW and RDW headers) and it throw3s error as BDW header have non-zero values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reading Variable Length File with OCCCURS DEPENDING #666

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reading Variable Length File with OCCCURS DEPENDING #666

Description

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions