Building an Automatic Schema Generator for Arbitrary Datasets

As a developer, I've often faced the challenge of working with unfamiliar datasets. Whether it's CSV files from clients or JSON data from APIs, understanding the structure and content of the data is crucial. That's why I decided to build a system that automatically generates a schema for any given dataset. In this post, I'll walk you through the process, sharing the challenges I faced and the solutions I implemented.

The Initial Problem: Detecting Data Types

The first hurdle was accurately detecting the data type of each column. Initially, I started with a simple regex-based approach:

const detectDataType = (value: string): string => {
  if (!value) return "error";

  if (/^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$/.test(value)) return "email";
  if (/^(https?:\/\/)?[\w\-]+(\.[\w\-]+)+[\/\w\-]*$/.test(value)) return "url";
  if (/^\d+$/.test(value)) return "integer";
  if (/^\d*\.\d+$/.test(value)) return "float";

  // ... more regex patterns ...

  return "text";
}

However, I quickly found this approach limiting. The regex patterns weren't robust enough to handle all edge cases, and maintaining a complex set of regular expressions became cumbersome.

Another issue I encountered was the use of string literals to represent data types. This approach was brittle and prone to typos. To address this, I created a `DataType` enum:

export enum DataType {
  Error = "error",
  Email = "email",
  URL = "url",
  Phone = "phone",
  Currency = "currency",
  Integer = "integer",
  Float = "float",
  Date = "date",
  Boolean = "boolean",
  Address = "address",
  MultiSelect = "multi_select",
  SingleSelect = "single_select",
  Text = "text",
  Mixed = "mixed",
  Empty = "empty",
}

This enum provided type safety and made it easier to manage the various data types throughout the application.

Next, I explored using Zod, a TypeScript-first schema validation library. While Zod is powerful for defining and validating schemas, it wasn't purpose-built for detecting data types from raw string values. It required more setup and didn't provide the flexibility I needed for this specific use case.

Interestingly, it was an AI language model (Claude) that suggested using the `validator` library. This recommendation turned out to be exactly what I needed. The `validator` library is specifically designed for string validation and provides a wide range of built-in validators for common data types.

Here's the improved `detectDataType` function using the `validator` library and the `DataType` enum:

export const detectDataType = (value: string): DataType => {
  if (!value) return DataType.Empty;

  if (validator.isEmail(value)) return DataType.Email;
  if (validator.isURL(value)) return DataType.URL;
  if (validator.isMobilePhone(value, "any")) return DataType.Phone;
  if (validator.isCurrency(value, { allow_negatives: true, require_symbol: true }))
    return DataType.Currency;

  // ... more type checks ...

  return DataType.Text;
};

This approach combines the simplicity of the `validator` library with custom logic for more specific data types. It starts with more specific types (like Email and URL) and moves to more general ones (like Integer and Float). If no specific type is detected, it defaults to Text. The use of the `DataType` enum ensures type safety and makes the code more maintainable.

Sidebar: The Challenge of Detecting Dates

One particularly challenging aspect of data type detection was handling dates. Unlike emails or URLs, which have relatively consistent formats, dates can be represented in numerous ways (e.g., "2023-05-15", "15/05/2023", "May 15, 2023"). Initially, we tried using regular expressions, but this approach quickly became unwieldy due to the sheer number of possible date formats.

We then explored libraries like `moment.js` and `date-fns`, which offer robust date parsing capabilities. However, these libraries are quite large and would have significantly increased our bundle size for a single feature.

Ultimately, we settled on a simple yet effective solution using JavaScript's built-in `Date` object:

const parsedDate: Date | typeof NaN = new Date(value);
if (parsedDate instanceof Date && !isNaN(parsedDate.getTime())) return DataType.Date;

This approach leverages the `Date` constructor's ability to parse a wide range of date formats. By checking if the result is a valid `Date` object and not `NaN`, we can reliably detect most common date formats without adding any additional dependencies.

The Solution: Generating the Schema

With the data type detection in place, I created a `generateSchema` function to analyze the entire dataset:

export const generateSchema = (data: string[][]): SchemaResult => {
  const schema: { [key: string]: DataType } = {};
  const typeCounts: { [key: string]: Record<DataType, number> } = {};
  const uniqueValues: { [key: string]: Set<string> } = {};

  // ... implementation details ...

  return { schema, uniqueValues, typeCounts };
};

This function iterates through each column, sampling up to 100 rows to determine the most common data type. It also keeps track of unique values and type counts for each column.

A New Challenge: Handling Edge Cases

As I tested the system with various datasets, I encountered some edge cases that needed special handling:

  1. Columns with mixed data types
  2. Text columns that were actually categorical (select) fields
  3. Ensuring performance with large datasets

Refining the Solution

To address these challenges, I made several improvements:

  1. For mixed data types, I introduced a `DataType.Mixed` type and set it when no single type represented more than 50% of the sampled values.
  2. To detect categorical fields, I added this logic:
if (
  schema[header] == DataType.Text &&
  uniqueValues[header].size <= Math.min(5, data.length * 0.5)
) {
  schema[header] = DataType.SingleSelect;
}

This checks if a Text column has a small number of unique values relative to the dataset size, indicating it's likely a categorical field.

  1. To ensure performance, I limited the sample size to 100 rows:
const sampleSize = Math.min(100, data.length - 1);

Visualizing the Results

To make the schema information more user-friendly, I created a React component to display the results:

const SchemaView = ({ data }: { data: string[][] }) => {
  const { schema, uniqueValues, typeCounts } = generateSchema(data);

  return (
    // ... table rendering logic ...
  );
};

This component creates a table showing each column's name, inferred type, number of unique values, and a visual representation of the value distribution.

Conclusion

Building this automatic schema generator has been an enlightening journey. It's taught me the importance of handling edge cases, balancing accuracy with performance, and presenting complex information in a user-friendly way.

While this system works well for many datasets, there's always room for improvement. Future enhancements could include more sophisticated type detection algorithms, support for nested data structures, and the ability to handle very large datasets efficiently.

I hope this walkthrough has given you some insights into tackling similar problems in your own projects. Remember, the key is to start simple, test thoroughly, and iterate based on real-world usage.