Building Clean Datasets
How to build clean datasets from messy, unstructured files
- Create Collection based on a Postgres source.
- Turn on a Collection specific ingest email address.
- Files go to S3
- New files in S3 gets parsed (‘AI’ process) then shoved into Postgres
- Manual review is triggered on all new rows
What we are building
Imagine if you are a marketplace or a listings business, and you rely on third party, user generated content to fill your catalog. For example, you are a real estate marketplace, and you process new listings from sellers. Or, you are a travel marketplace, and you process new listings from hotels.
People email, WhatsApp, airdrop new information to you all the time, and the problem is that the files you get are messy, unstructured, and not fit for your purposes. You have to manually go through each file, extract the data, and put it into your database. Your operations team is overworked, and you can't scale. In addition:
- You already have a database of listings (in Postgres), and you want to keep it updated with the latest information
- You have a place to store the files (for example, in S3)
Let's see how we can use Datograde to bring these files in one place, process them with AI (finding the right prompts in the process), and build a high quality, scalable catalog that powers your business.
Solution Overview
When a supplier sends you a new email, with an attached file:
- You open a Datograde Form to paste the email and upload the file
- The file will go to S3 and recorded in the database, alongside the email contents
- Datograde will parse the file with AI with your configured prompts, and record the results in the database
- You will get an email with a link to review the new data
- You will approve the new data, and it will be added to your catalog
1. Connecting Datograde to your databases
Postgres
- Navigate to the Integrations page in your account settings
- Click on PostgreSQL to open the configuration form
- Fill in the following required connection details.
See PostgreSQL for more details.
S3
- Navigate to the S3 Integration page in your account settings
- Enter your S3 credentials
- Click Save Configuration to store your credentials
Once configured, you'll see a green checkmark indicating that S3 is properly connected.
See S3 for more details.
2. Bringing files into Datograde
Creating a new Collection based on a Postgres source
- Navigate to the Collections page
- Click on + Collection
- Select Postgres as the source
- Fill in the required fields
Creating a Form to capture data from new emails and messages
- In the collection you just created, click on the Forms tab
- Click on Create Form
- Enter a Title, Description for the form
- Select the fields that you want people to fill out
- Click Create
3. Parsing files with AI
- Click on the Fields tab
- Create a new AI field by clicking on Add Field
- In the new field, select AI generate as the field type
- Configure the Prompt and input mappings
- Check Run on new rows
- Click Save
At this point, you should have these fields in your Collection:
- A field for the email body
- A field for the file
- A field for the parsed data (AI generated)
4. Reviewing parsed data
Now, whenever a new email with a file is submitted, Datograde will parse the file with AI, and record the results in the database.
You can see the submissions in the Forms tab.
When you approve a submission, the parsed data will be added to the Collection.
5. Building a catalog
With only clean entries in your Collection, you can ship this content to your product. Go to the Ship tab to get our REST API and include this data in your product.