Data Generation

 

Datasets can be automatically generated for schemas for use in testing queries for grading. The datasets are built with random data and foreign keys are respected.

Data Generation Use

Data generation does not function properly on schemas with “-“ in the table or schema names.

Data Generation Implementation

Data generation is implemented through the Python faker library, which generates SQL INSERT statements to load generated data into a table. Foreign keys and uniqueness constraints are detected to specify data generation ordering and ensure the generated data adheres to the schema requirements.

For efficiency during grading, a single container is created and the data inserted before a snapshot is taken to preserve the completed environment state. The snapshot is used as a container image for all student submissions to be graded, eliminating the need to run the large insert statements on additional database instances.

Data Types Observed

The following data types are observed for data generation:

  • string, nvarchar, varchar, nchar, char, text
  • uniqueidentifier
  • int, tinyint, smallint, bigint
  • float
  • decimal, numeric
  • money, currency
  • binary
  • bit
  • date, time, datetime, datetime2, datetimeoffset

Data types not listed may generate a blank column.

Data Generation Limitations

  • Data generation does not function properly on schemas with “-“ in the table or schema names.
  • Certain data types will not generate data, including XML, array