Are 50 million rows a problem for postgres ?

Discussion:

Vasilis Ventirozos

2003-09-08 07:32:51 UTC

Hi all, i work in a telco and i have huge ammount of data, (50 million)
but i see a lack of performance at huge tables with postgres,
are 50 million rows the "limit" of postgres ? (with a good performance)
i am waiting for 2004 2 billion records so i have to do something.
Does anyone have a huge database to ask him some issues ?

my hardware is good ,my indexes are good plz dont answer me something like use
vacuum :)

The Joy of Index

Vasilis Ventirozos
--------------------

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Hornyak Laszlo

2003-09-08 08:32:37 UTC

Permalink

Could you give more detailed information?
What does explain say?

Post by Vasilis Ventirozos
Hi all, i work in a telco and i have huge ammount of data, (50 million)
but i see a lack of performance at huge tables with postgres,
are 50 million rows the "limit" of postgres ? (with a good performance)
i am waiting for 2004 2 billion records so i have to do something.
Does anyone have a huge database to ask him some issues ?
my hardware is good ,my indexes are good plz dont answer me something like use
vacuum :)
The Joy of Index
Vasilis Ventirozos
--------------------
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Vasilis Ventirozos

2003-09-08 08:56:15 UTC

Permalink

it's not a stadard statement, i am tring to get statistics for the company and
i see a lack of performance (the same statement on informix runs good with
the same indexes of course)

Vasilis Ventirozos

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Sam Barnett-Cormack

2003-09-08 09:16:21 UTC

Permalink

I have a similarly huge number of records, as I process our web, ftp,
and rsync logs together using postgres. Works like a charm. You do have
to allow that queries are going to take a long time. I use about 6
queries to summarise a quarter's data - each run for each month, so a
total of 18 queries. These run in a little over 24 hours. And there are
many, many records per month.

--
Sam Barnett-Cormack
Software Developer | Student of Physics & Maths
UK Mirror Service (http://www.mirror.ac.uk) | Lancaster University

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Donald Fraser

2003-09-08 10:03:12 UTC

Permalink

I did some performance testing back on PostgreSQL version 7.2 on a table of
350,000 records.
My analysis at the time was that to access random records, performance
deteriorated the further away the records that you were accessing were from the
beginning of the index. For example using a query that had say OFFSET 250000
would cause large delays. On a text index these delays were in the order of 60
seconds! Which was unacceptable for the application I was developing.
To overcome this problem I had to do away with queries that used OFFSET and
developed queries that accessed the data relative to previous records I had
accessed. The only draw back was that I had to make the indexes unique by
appending a unique id column as the last column in the index. I now have no
problem scanning a table in access of 1 million records in small chunks at a
time.

Anyway without seeing what sort of query you are having problems with nobody on
these email lists will be able to fully help you.
Minimum we need to see an SQL statement, and the results of EXPLAIN.

Regards
Donald Fraser.

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Vasilis Ventirozos

2003-09-08 10:26:05 UTC

Permalink

This is a simple statement that i run

core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode;
QUERY PLAN
---------------------------------------------------------------------------------------
Aggregate (cost=2057275.91..2130712.22 rows=979151 width=4)
-> Group (cost=2057275.91..2106233.45 rows=9791508 width=4)
-> Sort (cost=2057275.91..2081754.68 rows=9791508 width=4)
Sort Key: spcode
-> Seq Scan on callticket (cost=0.00..424310.08 rows=9791508
width=4)
(5 rows)

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Sam Barnett-Cormack

2003-09-08 10:42:42 UTC

Permalink

Post by Vasilis Ventirozos
This is a simple statement that i run
core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode;

Well, yeah. Whatever you do, a complete seqscan and count is going to
take a long time, in the order of hours rather than days I would expect.
However, you may want to ensure that all tuning in postgresql.conf is
correct, as it may not be using all possible resources. That will
probably only make a small difference.

Gaetano Mendola

2003-09-08 10:54:21 UTC

Permalink

-------------

Post by Vasilis Ventirozos
Aggregate (cost=2057275.91..2130712.22 rows=979151 width=4)
-> Group (cost=2057275.91..2106233.45 rows=9791508 width=4)
-> Sort (cost=2057275.91..2081754.68 rows=9791508 width=4)
Sort Key: spcode
-> Seq Scan on callticket (cost=0.00..424310.08 rows=9791508
width=4)
(5 rows)

May we see your configuration file ?

Regards
Gaetano Mendola

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Vasilis Ventirozos

2003-09-08 11:18:38 UTC

Permalink

I use the default comfiguration file with the tcpip enabled
any sagestion about the configuration file ?

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Sam Barnett-Cormack

2003-09-08 11:31:25 UTC

Permalink

Post by Vasilis Ventirozos
I use the default comfiguration file with the tcpip enabled
any sagestion about the configuration file ?

Post a copy of it to the list, along with the specs of the machine it is
running on, and I'm sure we'll all pipe in.

Vasilis Ventirozos

2003-09-08 11:31:30 UTC

Permalink

The Server is a dual Xeon 2.4 HP with a 15k rpm scsi disk and 2 Gigz of ram

# Connection Parameters
#
tcpip_socket = true
#ssl = false

#max_connections = 32
#superuser_reserved_connections = 2

port = 5432
#hostname_lookup = false
#show_source_port = false

#unix_socket_directory = ''
#unix_socket_group = ''
#unix_socket_permissions = 0777 # octal

#virtual_host = ''

#krb_server_keyfile = ''

#
# Shared Memory Size
#
#shared_buffers = 64 # min max_connections*2 or 16, 8KB each
#max_fsm_relations = 1000 # min 10, fsm is free space map, ~40 bytes
#max_fsm_pages = 10000 # min 1000, fsm is free space map, ~6 bytes
#max_locks_per_transaction = 64 # min 10
#wal_buffers = 8 # min 4, typically 8KB each

#
# Non-shared Memory Sizes
#
#sort_mem = 1024 # min 64, size in KB
#vacuum_mem = 8192 # min 1024, size in KB# Write-ahead log
(WAL)
#
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300 # range 30-3600, in seconds
#
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
#
#fsync = true
#wal_sync_method = fsync # the default varies across platforms:
# # fsync, fdatasync, open_sync, or
open_datasync
#wal_debug = 0 # range 0-16

#
# Optimizer Parameters
#
#enable_seqscan = true
#enable_indexscan = true
#enable_tidscan = true
#enable_sort = true
#enable_nestloop = true
#enable_mergejoin = true
#enable_hashjoin = true

#effective_cache_size = 1000 # typically 8KB each
#random_page_cost = 4 # units are one sequential page fetch cost
#cpu_tuple_cost = 0.01 # (same)
#cpu_index_tuple_cost = 0.001 # (same)
#cpu_operator_cost = 0.0025 # (same)

#default_statistics_target = 10 # range 1-1000
# GEQO Optimizer Parameters
#
#geqo = true
#geqo_selection_bias = 2.0 # range 1.5-2.0
#geqo_threshold = 11
#geqo_pool_size = 0 # default based on tables in statement,
# range 128-1024
#geqo_effort = 1
#geqo_generations = 0
#geqo_random_seed = -1 # auto-compute seed

#
# Message display
#
#server_min_messages = notice # Values, in order of decreasing detail:
# debug5, debug4, debug3, debug2, debug1,
# info, notice, warning, error, log, fatal,
# panic
#client_min_messages = notice # Values, in order of decreasing detail:
# debug5, debug4, debug3, debug2, debug1,
# log, info, notice, warning, error
#silent_mode = false

log_connections = true
log_pid = true
log_statement = true
log_duration = true
log_timestamp = true

#log_min_error_statement = panic # Values in order of increasing severity:
# debug5, debug4, debug3, debug2, debug1,
# info, notice, warning, error, panic(off)

#debug_print_parse = false
#debug_print_rewritten = false
#debug_print_plan = false
#debug_pretty_print = false

#explain_pretty_print = true

# requires USE_ASSERT_CHECKING
#debug_assertions = true

#
# Syslog
#
#syslog = 0 # range 0-2
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'

#
# Statistics
#
#show_parser_stats = false
#show_planner_stats = false
#show_executor_stats = false
#show_statement_stats = false

# requires BTREE_BUILD_STATS
#show_btree_build_stats = false

#
# Access statistics collection
#
#stats_start_collector = true
#stats_reset_on_server_start = true

#stats_command_string = false
#stats_row_level = false
#stats_block_level = false

#
# Lock Tracing
#
#trace_notify = false

# requires LOCK_DEBUG
#trace_locks = false
#trace_userlocks = false
#trace_lwlocks = false
#debug_deadlocks = false
#trace_lock_oidmin = 16384
#trace_lock_table = 0

#
# Misc
#
#autocommit = true
#dynamic_library_path = '$libdir'
#search_path = '$user,public'
#datestyle = 'iso, us'
#timezone = unknown # actually, defaults to TZ environment setting
#australian_timezones = false
#client_encoding = sql_ascii # actually, defaults to database encoding
#authentication_timeout = 60 # 1-600, in seconds
#deadlock_timeout = 1000 # in milliseconds
#default_transaction_isolation = 'read committed'
#max_expr_depth = 10000 # min 10
#max_files_per_process = 1000 # min 25
#password_encryption = true
#sql_inheritance = true
#transform_null_equals = false
#statement_timeout = 0 # 0 is disabled, in milliseconds
#db_user_namespace = false

#
# Locale settings
#
# (initialized by initdb -- may be changed)
LC_MESSAGES = 'C'
LC_MONETARY = 'C'
LC_NUMERIC = 'C'
LC_TIME = 'C'

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Sam Barnett-Cormack

2003-09-08 11:43:03 UTC

Permalink

On Mon, 8 Sep 2003, Vasilis Ventirozos wrote:

This bit is simple and probably wants leaving alone, except for very
specific changes as you need them

Post by Vasilis Ventirozos
# Connection Parameters
#
tcpip_socket = true
#ssl = false
#max_connections = 32
#superuser_reserved_connections = 2
port = 5432
#hostname_lookup = false
#show_source_port = false
#unix_socket_directory = ''
#unix_socket_group = ''
#unix_socket_permissions = 0777 # octal
#virtual_host = ''
#krb_server_keyfile = ''
#
# Shared Memory Size
#
#shared_buffers = 64 # min max_connections*2 or 16, 8KB each
#max_fsm_relations = 1000 # min 10, fsm is free space map, ~40 bytes
#max_fsm_pages = 10000 # min 1000, fsm is free space map, ~6 bytes
#max_locks_per_transaction = 64 # min 10
#wal_buffers = 8 # min 4, typically 8KB each

These probably want upping, quite a lot for your system

max_fsm_relations at least 5000, I'd say, max_fsm_pages 50000,
shared_buffers can be made to be huge

Post by Vasilis Ventirozos
#
# Non-shared Memory Sizes
#
#sort_mem = 1024 # min 64, size in KB
#vacuum_mem = 8192 # min 1024, size in KB# Write-ahead log
(WAL)

These want upping, quite a lot. Like:

sort_mem = 65536
vacuum_mem = 32768

Are what I use, and I have a much lower-power box

Post by Vasilis Ventirozos
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300 # range 30-3600, in seconds
#
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
#
#fsync = true
# # fsync, fdatasync, open_sync, or
open_datasync
#wal_debug = 0 # range 0-16

Less sure about that lot.

The next section is only used by the query planner, debug it if you have
silly plans being generated. You probably want to change
random_page_cost

Post by Vasilis Ventirozos
#
# Optimizer Parameters
#
#enable_seqscan = true
#enable_indexscan = true
#enable_tidscan = true
#enable_sort = true
#enable_nestloop = true
#enable_mergejoin = true
#enable_hashjoin = true
#effective_cache_size = 1000 # typically 8KB each
#random_page_cost = 4 # units are one sequential page fetch cost
#cpu_tuple_cost = 0.01 # (same)
#cpu_index_tuple_cost = 0.001 # (same)
#cpu_operator_cost = 0.0025 # (same)
#default_statistics_target = 10 # range 1-1000
# GEQO Optimizer Parameters
#
#geqo = true
#geqo_selection_bias = 2.0 # range 1.5-2.0
#geqo_threshold = 11
#geqo_pool_size = 0 # default based on tables in statement,
# range 128-1024
#geqo_effort = 1
#geqo_generations = 0
#geqo_random_seed = -1 # auto-compute seed

A lot of system stuff now, scroll down...

Post by Vasilis Ventirozos
#
# Message display
#
# debug5, debug4, debug3, debug2, debug1,
# info, notice, warning, error, log, fatal,
# panic
# debug5, debug4, debug3, debug2, debug1,
# log, info, notice, warning, error
#silent_mode = false
log_connections = true
log_pid = true
log_statement = true
log_duration = true
log_timestamp = true
# debug5, debug4, debug3, debug2, debug1,
# info, notice, warning, error, panic(off)
#debug_print_parse = false
#debug_print_rewritten = false
#debug_print_plan = false
#debug_pretty_print = false
#explain_pretty_print = true
# requires USE_ASSERT_CHECKING
#debug_assertions = true
#
# Syslog
#
#syslog = 0 # range 0-2
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'
#
# Statistics
#
#show_parser_stats = false
#show_planner_stats = false
#show_executor_stats = false
#show_statement_stats = false
# requires BTREE_BUILD_STATS
#show_btree_build_stats = false
#
# Access statistics collection
#
#stats_start_collector = true
#stats_reset_on_server_start = true
#stats_command_string = false
#stats_row_level = false
#stats_block_level = false
#
# Lock Tracing
#
#trace_notify = false
# requires LOCK_DEBUG
#trace_locks = false
#trace_userlocks = false
#trace_lwlocks = false
#debug_deadlocks = false
#trace_lock_oidmin = 16384
#trace_lock_table = 0

Some useful stuff in Misc

deadlock timeout probably wants increasing, for safety's sake. Other
things can be played with.

Post by Vasilis Ventirozos
#
# Misc
#
#autocommit = true
#dynamic_library_path = '$libdir'
#search_path = '$user,public'
#datestyle = 'iso, us'
#timezone = unknown # actually, defaults to TZ environment setting
#australian_timezones = false
#client_encoding = sql_ascii # actually, defaults to database encoding
#authentication_timeout = 60 # 1-600, in seconds
#deadlock_timeout = 1000 # in milliseconds
#default_transaction_isolation = 'read committed'
#max_expr_depth = 10000 # min 10
#max_files_per_process = 1000 # min 25
#password_encryption = true
#sql_inheritance = true
#transform_null_equals = false
#statement_timeout = 0 # 0 is disabled, in milliseconds
#db_user_namespace = false
#
# Locale settings
#
# (initialized by initdb -- may be changed)
LC_MESSAGES = 'C'
LC_MONETARY = 'C'
LC_NUMERIC = 'C'
LC_TIME = 'C'

Hope some of that helps.

Bruno Wolff III

2003-09-08 13:21:08 UTC

Permalink

On Mon, Sep 08, 2003 at 13:26:05 +0300,

Post by Vasilis Ventirozos
This is a simple statement that i run
core_netfon=# EXPLAIN select spcode,count(*) from callticket group by spcode;
QUERY PLAN
---------------------------------------------------------------------------------------
Aggregate (cost=2057275.91..2130712.22 rows=979151 width=4)
-> Group (cost=2057275.91..2106233.45 rows=9791508 width=4)
-> Sort (cost=2057275.91..2081754.68 rows=9791508 width=4)
Sort Key: spcode
-> Seq Scan on callticket (cost=0.00..424310.08 rows=9791508
width=4)
(5 rows)

In addition to making the changes to the config file as suggested in other
responses, you may also want to do some testing with the 7.4 beta.
Hash aggreates will most likely speed this query up alot (assuming there
aren't millions of unique spcodes). The production release of 7.4 will
probably happen in about a month.

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ***@postgresql.org

Tom Lane

2003-09-08 15:14:00 UTC

Permalink

Post by Donald Fraser
My analysis at the time was that to access random records, performance
deteriorated the further away the records that you were accessing were
from the beginning of the index. For example using a query that had
say OFFSET 250000 would cause large delays.

Well, yeah. OFFSET implies generating and discarding that number of
records. AFAICS there isn't any shortcut for this, even in a query
that's just an indexscan, since the index alone can't tell us whether
any given record would actually be returned.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Bruno Wolff III

2003-09-08 13:15:50 UTC

Permalink

On Mon, Sep 08, 2003 at 10:32:51 +0300,

Well if you want specific answers, you need to give us some specific data.
For example, explain anaylze output for the slow queries and relevant
table definitions.

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Ron Mayer

2003-09-10 00:49:41 UTC

Permalink

I have worked on a datawarehouse (postgresql 7.3) with a
pretty standard star schema with over 250 million rows on
the central 'fact' table, and anywhere from 100 to 10+ million
records in the surrounding 'dimension' tables.

The most common queries were simple joins between 3 tables, with
selects on one of the ids. These took a few (1-60) seconds.
About 500,000 new records were loaded each night; and the ETL
processing and creating some aggregates took about 11 hours/night
with 7.3, and 9 hours/night with 7.4beta.

Hope this helps.

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Continue reading on narkive:

Search results for 'Are 50 million rows a problem for postgres ?' (Questions and Answers)

replies

How do run a SQL script and have the data inputed in a Excel sheet.?

started 2007-05-06 16:50:57 UTC

programming & design

replies

What is the is the difference between SQL Server and Microsoft Access? What should someone learn first?

started 2013-12-09 07:38:52 UTC

programming & design

replies

It is possible to implement a Data Warehouse with open data base?

started 2008-02-15 12:09:57 UTC

programming & design