AI-augmented data preparation in BigQuery, powered by Gemini, offers intelligent suggestions for cleaning, transforming, and enriching data, significantly reducing manual effort. Dataform orchestrates these preparations, supporting CI/CD processes for collaboration.
Benefits
- Time Reduction: Context-aware, Gemini-generated transformation suggestions.
- Data Quality: Automated schema mapping and data quality cleanup.
- Collaboration: CI/CD support for code reviews and source control.
Users and Dataform service accounts need specific IAM roles. Data preparations are managed in BigQuery Studio. Opening a table triggers a BigQuery job that samples data for Gemini to generate suggestions.
Views in the Data Preparation Editor
- Data View: Displays a sample of the table and allows interaction and application of Gemini suggestions.
- Graph View: Visual overview of the data preparation pipeline.
- Schema View: Displays and allows operations on the current schema.
Gemini offers context-aware suggestions for transformations, data quality rules, standardization, enrichment, and schema mapping. Each suggestion includes a high-level category, description, and corresponding SQL expression.
BigQuery uses data sampling to preview data preparation. Samples are not automatically refreshed. Optimize costs and processing time by changing write mode settings to incrementally process new data. Supported modes include Full refresh, Append, and Incremental.
Supported Data Preparation Steps
- Source: Adds a source table or join step.
- Transformation: Cleans and transforms data using SQL expressions.
- Filter: Removes rows using WHERE clause syntax.
- Validation: Sends rows meeting validation criteria to an error table.
- Join: Joins values from two sources with various join operations.
- Destination: Defines where to output data preparation steps.
- Delete Columns: Removes columns from the schema.
Schedule one-time or recurring data preparation runs from the data preparation editor or manage them from the BigQuery Orchestration page. BigQuery data preparation does not have its own API. Contact bq-datapreparation-feedback@google.com for more information.
Limitations
- Source and destination datasets must be in the same location.
- Data and interactions are processed in a US data center during pipeline editing.
- No support for natural language SQL query generation or viewing/comparing/restoring data preparation versions.
- Gemini responses are based on a sample of the dataset.
For more detailed steps and configurations, refer to the BigQuery documentation below.