45

In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.:

Schema = StructType([ StructField("temperature", DoubleType(), True),
                      StructField("temperature_unit", StringType(), True),
                      StructField("humidity", DoubleType(), True),
                      StructField("humidity_unit", StringType(), True),
                      StructField("pressure", DoubleType(), True),
                      StructField("pressure_unit", StringType(), True)
                    ])

For some datasources it is possible to infer the schema from the data-source and get a dataframe with this schema definition.

Is it possible to get the schema definition (in the form described above) from a dataframe, where the data has been inferred before?

df.printSchema() prints the schema as a tree, but I need to reuse the schema, having it defined as above,so I can read a data-source with this schema that has been inferred before from another data-source.

5 Answers 5

58

Yes it is possible. Use DataFrame.schema property

schema

Returns the schema of this DataFrame as a pyspark.sql.types.StructType.

>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

New in version 1.3.

Schema can be also exported to JSON and imported back if needed.

12

The code below will give you a well formatted tabular schema definition of the known dataframe. Quite useful when you have very huge number of columns & where editing is cumbersome. You can then now apply it to your new dataframe & hand-edit any columns you may want to accordingly.

from pyspark.sql.types import StructType

schema = [i for i in df.schema] 

And then from here, you have your new schema:

NewSchema = StructType(schema)
12

If you are looking for a DDL string from PySpark:

df: DataFrame = spark.read.load('LOCATION')
schema_json = df.schema.json()
ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL()
1
  • 1
    @user1119283: instead of df.schema.json() try with df.select('yourcolumn').schema.json() ?
    – anky
    Jun 8, 2022 at 17:30
9

You could re-use schema for existing Dataframe

l = [('Ankita',25,'F'),('Jalfaizy',22,'M'),('saurabh',20,'M'),('Bala',26,None)]
people_rdd=spark.sparkContext.parallelize(l)
schemaPeople = people_rdd.toDF(['name','age','gender'])

schemaPeople.show()

+--------+---+------+
|    name|age|gender|
+--------+---+------+
|  Ankita| 25|     F|
|Jalfaizy| 22|     M|
| saurabh| 20|     M|
|    Bala| 26|  null|
+--------+---+------+

spark.createDataFrame(people_rdd,schemaPeople.schema).show()

+--------+---+------+
|    name|age|gender|
+--------+---+------+
|  Ankita| 25|     F|
|Jalfaizy| 22|     M|
| saurabh| 20|     M|
|    Bala| 26|  null|
+--------+---+------+

Just use df.schema to get the underlying schema of dataframe

schemaPeople.schema

StructType(List(StructField(name,StringType,true),StructField(age,LongType,true),StructField(gender,StringType,true)))
0

Pyspark since version 3.3.0 return df.schema in python-way https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.schema.html#pyspark.sql.DataFrame.schema

>>> df.schema
StructType([StructField('age', IntegerType(), True),
            StructField('name', StringType(), True)])

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.