Creative use of date partitions

Asked at 2017-01-10 15:56:39Z
  • 5 Subscribers
  • 114 Views
1

I have some data that I would like to partition by date, and also partition by an internally-defined client id.

Currently, we store this data uses the table-per-date model. It works well, but querying individual client ids is slow and expensive.

We have considered creating a table per client id, and using date partitioning within those tables. The only issue here is that would force us to incur thousands of load jobs per day, and also have the data partitioned by client id in advance.

Here is a potential solution I came up with: -Stick with the table-per-date approach (eg log_20170110) -Create a dummy date column which we use as the partition date, and set that date to -01-01 (eg for client id 1235, set _PARTITIONTIME to 1235-01-01)

This would allow us to load data per-day, as we do now, would give us partitioning by date, and would leverage the date partitioning functionality to partition by client id. Can you see anything wrong with this approach? Will BigQuery allow us to store data for the year 200, or the year 5000?

PS: We could also use a scheme that pushes the dates to post-zero-unixtime, eg add 2000 to the year, or push the last two digits to the month and day, eg 1235 => 2012-03-05.


2 answers in total

0
Mikhail Berlyant Posted at 2017-01-11 03:02:33Z

Will BigQuery allow us to store data for the year 200, or the year 5000?

Yes, any date between 00001-01-01 and 9999-12-31

So formally speaking this is an option (and btw depends on how many clients you plan / already have)

See more about same idea at http://stackoverflow.com/a/41091896/5221944

Meantime, I would expect BigQuery to have soon ability to partition by arbitrary field. Maybe at NEXT 2017 - just guessing :o)

0
Pavan Edara Posted at 2017-01-11 07:46:23Z

The suggested idea is likely to create some performance issues for queries (as the number of partitions increase). Generally speaking, Date partitioning works well with a few 1000 partitions.

client_ids are generally unrelated with each other and are ideal for hashing. While we work towards supporting richer partitioning flavors, one option is to hash your client_ids into N buckets (~100?), and have N partitioned tables. That way you can query across your N tables for a given date. Using, for example, 100 tables would drop the cost down to 1% of what it would be using 1 table with all the client_ids. It should also scan a small number of partitions, improving performance also accordingly. Unfortunately, this approach doesn't address the concern of putting the client ids in the right table (it has to be managed by you).

Answer this questsion