![]() |
|
Published 2008-07-22 Printer-friendly version
In his Pro PostgreSQL tutorial session Robert Treat covered a variety of topics
including understanding the nature of the PostgreSQL community, performance
optimizations, upgrade paths, availability and hardware. Treat is a database
architect at OmniTI and co-author of the book
Beginning
PHP and PostgreSQL 8. He also maintains the phpPgAdmin
software package and has been recognized as a major contributor to the PostgreSQL
project for his work over the years.
Robert began with an overview of the nature of the PostgreSQL open source community. Unlike commercial databases there is no one corporate backer deciding the future direction of PostgreSQL. Rather, changes to the product happen via an organic process involving a variety of groups.
The focus of the development community is the pgsql-hackers mailing list. Anyone can get involved at this level, but most of these folks don't actually have the rights to commit changes to the source code. Instead they write and test their own code and submit patches. These patches can only be applied to the official source code by committers, a much smaller group of hackers numbering a dozen or so. There is no formal process to become a committer - just contribute a lot of good patches and in time you may be invited to become a committer.
At the top of the community hierarchy is the core team which has seven members. Not all of these people are committers. For instance Mark Fournier focuses on release management, and Josh Berkus (lately of Sun) is the media guy.
A big part of getting along with PostgreSQL is knowing what resources are available; to some extent you have to figure out for yourself what you need. Your starting point should be www.postgresql.org (this site is maintained by the PostgreSQL web team).
Other useful sites include:
You should be using a package manager/installer appropriate to your platform. If you're installing on Windows things are pretty simple - just use the Windows installer. Installing on *nix is potentially more complicated: if you're dealing with multiple platforms superuser accounts and command line utilities can vary. Package management only makes things consistent per platform, not across platforms.
The core package includes the database server as well as commonly-used utilities. But there's a whole lot of contrib (user contributed) code as well. On Windows this is part of the install; on other platforms it's usually a separate install. One caveat on contrib modules: upgrades can be a hassle of the contrib code depends on anything that was changed in the core PostgreSQL code. In general the contrib code is very solid. "Trust these more than your own code."
One of the more popular contrib modules is pgcrypto, which is not included in the core due to export restrictions. For help with tuning look at pgstatstuple, pg_buffercache, and pg_freespacemap.
There are various procedural languages available (four ship with the core code but there are about a dozen). Some are not included because they're released under different licenses, or because of dependency management issues, or because they're still pretty new. PL/Perl is probably the best of the bunch.
Step one after installation is to figure out logging. You'll definitely want
this so you can work out exactly what errors you are getting, since the interface
you're using to the database probably won't give you detailed information. Usually
the logs are in pg_log under the data directory ($PGDATA) - check
postgresql.conf if in doubt of the location. On *nix the log files may be in
/var/log/. If you're able to log in to the console then show data_directory
will reveal the file locations.
For performance reasons it's a good idea to keep the log files and the data files on separate drives.
Authentication is the next hurdle. If you don't see an error in the log (look
for FATAL) then check your firewall because you're probably not hitting
the database at all. Check your pg_hba.conf configuration.
TRUST authentication is okay if you have a secure network or you're
just doing development, but md5 is a better way to go as it offers
decent security. Treat didn't have much good to say about IDENT which
requires that your user name is the same as the PostgreSQL database user name.
This is fine if it's what you want, but it can cause a lot of headaches if you're
not aware of how IDENT works.
There are other ways to handle authentication but these are the big three.
Treat was candid about the problems with PostgreSQL upgrades. There is no in-place upgrade tool included with the core distribution so you will typically have to handle major version upgrades yourself. Normally "major" means any upgrade in which either the first or the second version number has changed; usually third version number changes are bug fixes/features only and do not involve database changes, just new binaries. All major upgrades involve pain, and skipping releases usually makes things worse. Always, always make a backup first and be sure to read the release notes.
There is a tool called pg_migrator which attempts to handle in-place upgrades. Right now it only works for 8.1 -> 8.2, but there is no 8.2 -> 8.3 path at present (and maybe never - the next path may be 8.3->8.4). pg_migrator works by rewriting the system catalogs, and there's no way to go back to the previous version. It's pretty new.
The simplest way to upgrade is to use pg_dump from the old database and pg_restore to the new database. Treat recommends the -Fc option for compressed files (on the order of a tenth of the size of the uncompressed file). If you have a large database this is going to be a long, long process, on the order of hours. And you pretty much need to have both versions of PostgreSQL installed at the same time. Use the pg_dump from the newer version, not the older version.
Note that pg_dump 8.2 has problems with versions prior to 7.3. If you're on a prior version then you need to upgrade to 7.3, run adddepends to fix the system catalog info, then upgrade to 8.3.
There are several ways you can avoid lengthy downtime. One is to replicate your database (using slony, postgres-r etc.) to the newer version. When you're sure everything's working properly just cut over to the new server. With slony, which is a master-slave system, you'll need to set the new server to be the master, but that's quickly done and downtime may be under a minute. But be aware of network load issues while you're doing the initial recreation of the slave (newer format) database.
Keep in mind that PostgreSQL ships with very conservative settings to enable it to run on lowly hardware with minimal available memory. That almost certainly means it won't run optimally on your system.
There are around 200 config options, but only about 20 will really matter for most DBAs. The top five are:
effective_cache_size - You probably want to use whatever ram
is available after your other apps are up and running. This memory will only
be used if required. 75% on a dedicated machine is probably about right.shared_buffers - This does preallocate memory so don't set
it higher than it needs to be. Start at 20% of your ram and tweak downwards.
It all depends on the nature of your database traffic. This requires a database
server restart so get it right before you into production.default_statistics_target - The default (10) is probably too
low. Set it to 100 as a starting point. The higher you set it the more data
sampling it will do. Set it really high (say 400) is probably too high for
most situations.work_mem - This is memory for sorting/hashing inside an SQL
query. You may want to set this higher, but this is the max amount allocated
for certain tasks but on a per-thread basis, so be careful if you have a lot
of concurrent processes.checkpoint_segments - These segments are chunks of data being
written to disk. If you have a lot of write activity you want to increase
this number. A value of 30 isn't unheard of. If fast recovery time after a
crash matters then you want to keep this number lower since if you have more
of these files the database will take longer to process them on restarting.PostgreSQL's vacuum command cleans up the database by reclaiming unused space, updating table stats, and preventing transaction ID wraparound. In previous versions you had to run vacuum explicitly, but PostgreSQL now has an autovacuum mode which you can successfully use with the default settings. There may be times when you need to run vacuum manually, usually only on tables that get excessive write traffic. Vacuum has a lazy and a full mode - full locks the table, and you should never have to run it regularly.
You may find situations where you need to reindex tables. This area has improved with recent releases but keep it in mind if you run into index performance problems.
Don't go cheap on the hardware. Think lots of ECC RAM, redundant disks, RAID (hardware of software, but software RAID can be easier to fix if you run into problems. For hard drives SCSI is best but SATA is often acceptable. If you're on Solaris consider RAID Z. Good hardware solves a lot of problems.
PostgreSQL makes good use of multiple CPUs. You get great payback up to 8 CPUs, or even 16, but by 32 you've probably maxed your bang for buck.
On 8.3 one processor isn't better than another per se. See tweakers.net
for benchmarks - they do a lot of testing with PostgreSQL as well as MySQL.
The site is mostly in Dutch but there's still lots of good info to be had. (My
tip: Google with the search term postgresql site:tweakers.net).
When using multiple disk configurations the usual rules apply - don't mix access types if you can help it. For instance, put write-ahead logs on their own disk (RAID 1 is fine for that) since these are all sequential writes. Put data directory on its own disk. Treat recommended RAID 10 over RAID 5 which he said was too slow (although one attendee suggested that after eight spindles you won't see a difference).
The more spindles the better. More controllers is even better if you want to use tablespaces. Separate tables into tablespaces by the kind of access they need - read vs write.
Postgresql scales up well - just throw more hardware at it. Use connection pooling if necessary but a single machine, properly equipped, can handle 1000+ connections with terabytes of data and 10,000 transactions per second.
Watch out for write-caching controllers - if you need this be sure to use battery backed controllers to avoid situations where the server crashes and you lose transactions.
On *nix systems do not use the NFS operating system; it is not fully safe.
Treat went over a number of options for ensuring data remains available and recoverable.
On *nix you can do file system snapshots with zfs, lvm. Wrap the snapshot with pg_start_backup and pg_stop_backup for safety.
You can also exploit PostgreSQL's point-in-time-recovery (PITR) feature which employs write-ahead log (WAL) files. If you take a snapshot of the database all you have to do to make it current is to replay any WAL files created since the last snapshot, and you don't have to be exact about when the logs start as long as they include the missing data.
If you're using replication with PostgreSQL then you're probably talking master-slave replication, and there's always the possibility that if the master goes down there will be transactions that never made it out to the slave. But you can do a very quick switchover. Another replication option is bucardo, which was open sourced a while ago. It's multi-master that also does master-slave, but it isn't yet available on Windows. Newer yet is postgres-r which is multi-master synchronous and open sourced just this month. It looks promising, but these are still early days.
There are a variety of other techniques for ensuring availability which reply
on either duplicated data or just duplicated server processes. Shared disks
let you run multiple servers on one data set, but you still have a single point
of failure in the data store. You could try pgpool which is a statement-based
dual master system; all your inserts and updates go to two different machines,.
There's no synchronization; rather you're relying on the SQL being executed
identically on both machines. Be aware that random(), now(),
and sequences could be different so even if all the statements execute the data
stores might not be identical. More problematic is you have to run it as a superuser
- you can't do normal authentication. And pgpool itself becomes a single point
of failure.
You can use replication to scale up your database, although in most cases this works best for scaling read operations since writes would have to be replicated on all machines. You can also use pgpool-II (not really related to pgpool) to put a single database across multiple machines. This product comes from the Japanese PostgreSQL community so the doc can be difficult to access. pgcluster is similar to pgpool-II and is a synchronous multi-master but scales read operations only.
The core team has announced that at some point they will officially support PITR read-only slaves, and there's a push on for warm standby.
Treat also mentioned some performance improvement strategies including delving
into the pg_stat* system tables and exploiting the EXPLAIN statement.
Set the log_min_duration_statement to log slow queries. Keep in
mind that PostgreSQL supports partial indexing with WHERE clauses and
functional/expressional indexes. Full text search is much faster than LIKE
expressions and has support for multiple languages, custom dictionaries, special
indexing options. Full text search is integrated into 8.3.
Treat's presentation was professional, thorough and clear. Although he's clearly a big fan of PostgreSQL he didn't shy away from the problems inherent in a community-based open source project, and he was blunt about PostgreSQL's weaknesses particularly when it comes up database upgrades. I'd like to see in-place upgrades (don't hold your breath) and a solid master-master replication option (perhaps postgres-r will fill that niche), but in its current state PostgreSQL is clearly capable of some pretty heavy lifting.
Copyright © 1999-2008 by CoveComm Inc. All Rights Reserved. Reproduction in any form without the express written consent of CoveComm Inc., except as described in the subscription agreement, is prohibited.
Clarion Magazine ISSN 1718-9942
One year: $159
(reg $189, save $30)
(includes all back issues since '99)
Renewals from $109
Two years: $249
(reg $289, save $40)
Renewals from $199