Calculating a proxy bit size in a BigQuery table

Asked at 2017-01-10 17:37:35Z
  • 5 Subscribers
  • 114 Views
0

How does one go about calculating the bit size of each record in BigQuery sharded tables across a range of time?

Objective: how much has it grown over time

Nuances: Of the 70 some fields, some records would have nulls for most, some records would have long string text grabbed directly from the raw logs, and some of them could be float/integer/date types.

Wondering if there's an easy way to do a proxy count of the bit size for one day and then I can expand that to a range of time.


1 answers in total

2
Mikhail Berlyant Posted at 2017-01-11 01:43:46Z

Example from my experience:
One of my tables is daily sharded table with daily size of 4-5TB. Schema has around 780 fields. I wanted to understand cost of each data-point (bit-size) [it was used then for calculating ROI based on cost/usage]

So, let me give you an idea on how cost (bit-size) side of it was approached.

The main piece here is use of dryRun property of Jobs: Query API

Setting dryRun to true allows BigQuery (instead of actually running job) return statistics about the job such as how many bytes would be processed. And that’s exactly what is needed here!

So, for example, below Request is designed to get cost of trafficSource.referralPath in ga_session table for 2017-01-05

POST https://www.googleapis.com/bigquery/v2/projects/yourBillingProject/queries?key={YOUR_API_KEY}
{
 "query": "SELECT trafficSource.referralPath FROM yourProject.yourDataset.ga_sessions_20170105`",
 "dryRun": true,
 "useLegacySql": false
}

You can get this value by parsing totalBytesProcessed out of Response. See example of such response below

{
"kind": "bigquery#queryResponse",
"jobReference": {
"projectId": "yourBillingProject"
},
"totalBytesProcessed": "371385",
"jobComplete": true,
"cacheHit": false

}

So, you can write relatively simple script in the client of your choice that:

  1. reads schema of your table – you can use Tables: get API for this or if schema is known and readily available you can just simply hardcode it
  2. organize loop through all (each and every) field in the schema
  3. inside loop – call query api and extract size of respective filed (as it is outlined above)) and of course log it (or just collect it in memory)

As a result of above - you will have list of all fields with their respective size
If now, you need to analyze those sizes changes over the time – you can wrap above with yet another loop where you will iterate through as many days as you need and collect stats for each and every day

if you are not interested in day-by-day analysis - you just can make sure your query actually queries the range you are interested with. This can be done with use of a Wildcard Table

I consider this relatively easy way to go with

Me personally, I remember doing this with Go-lang, but it doesn't matter - you can use any client that you are most comfortable with

Hope this will help you!

Answer this questsion