List:General Discussion« Previous MessageNext Message »
From:Brandon Phelps Date:September 8 2011 8:43pm
Subject:Re: Query Optimization
View as plain text  
Andrew,

Generally there is only 1 user performing the complicated SELECT query at a time, however
the background process that fills the table is constantly doing a fast SELECT (0.00003
seconds) and a subsequent UPDATE.  Basically whenever a connection is closed on the
firewall, the bg process SELECTS from the table the last record that was opened (and
hasn't already been closed) for a given source IP, source port, dest ip, and protocol. 
So for example whenever the firewall logs a "CONNECTION CLOSED" message, the bg process
does:

select id from sonicwall_connections where src_address = w.x.y.z and src_port = x and
dst_address = w.x.y.z and protocol = x ORDER BY open_dt LIMIT 0, 1

then it UPDATES the close_dt column for the record with the selected ID.  These
select/update statements being run by the background process generally take 0.000x
seconds each.

The only slowdown we see is with the complicated SELECT I have below, and this is only
executed once in a while throughout the day.  For example, if an IT manager sees a huge
bandwidth spike on the network, he may access the web gui to determine what the source IP
is so he can determine who is downloading a large file, etc.

I think what I will try to do is create 60 partitions for the table in question based on
month and day.  This way each day will go into it's own partition and if someone runs the
query for a date range such as 01/01/2011 - 01/07/2011 it will only need to access 7
partitions instead of the entire table.

My only question with this is how would I go about creating the table with 60 partitions
in such a way that I won't need to constantly add/remove new/old partitions every day? 
Is there any partitioning function I can use that would not require me to update the
partitions schema every day?  I already plan to have a cron run each night to purge
records older than 60 days from the database.

On 09/08/2011 03:26 PM, Andrew Moore wrote:
> Partitioning isn't a bad idea for this however I'm still thinking about your
> dataset size and possible hardware limitations. It's not likely going to fit
> into relevant buffers/memory so you're going to be on disk more then you
> want. You're probably creating temporary tables like crazy and I would bet
> that there are a good load of them heading to disk too. With your IO
> performance limited to a small amount of disks as you describe, you're not
> going to be able to get much more from these queries. Although a dedicated
> DB server are there other databases been accessed on the server? When
> looking at the scope of your data, are you capturing more then you need? How
> often and how far back are the users querying? How many users concurrently
> performing queries on the 32m record table?
>
> On Thu, Sep 8, 2011 at 8:04 PM, Brandon Phelps<bphelps@stripped>  wrote:
>
>> Mihail,
>>
>> I have considered this but have not yet determined how best to go about
>> partitioning the table.  I don't think partitioning by dst_address or
>> src_address would help because most of the queries do not filter on IP
>> address (except very specific queries where the end-user is searching the
>> table for history on a particular employee).
>>
>> I could potentially partition the table based on the day of week the
>> connection was opened on which may improve performance for a while since
>> this would take me from a single 32million record table down to roughly 4.5
>> million records per partition (32 / 7) however we are looking to eventually
>> store up to 2 months worth of data in the table, and so far that 32 million
>> records is only for 1 month, so I estimate another 32 million-ish before the
>> month is out, bringing me to roughly 70 million records total (it would be
>> nice if I could store even more than 2 months, but given my currently
>> performance dilemma I don't expect that to happen).  Also this does not take
>> into account that the end-user will often be pulling data for multiple days
>> at a time, meaning that multiple partitions in this scheme will need to be
>> accessed anyway.
>>
>> The only other logical partitioning scheme I can think of would be to
>> partition based on dst_port (the port the log relates to) but the majority
>> of records are all to port 80 (standard web traffic) so I don't think this
>> would be all that helpful.
>>
>> I have never messed with partitioning so it is totally possible that I am
>> not thinking of something, so if you have any ideas on a decent partitioning
>> scheme based on my criteria and queries below, please let me know.
>>
>> Thanks,
>> Brandon
>>
>>
>> On 09/08/2011 02:47 PM, Mihail Manolov wrote:
>>
>>> If you're running version 5.1+ you may wanna take a look at table
>>> partitioning options you may have.
>>>
>>> On Sep 8, 2011, at 2:27 PM, Brandon Phelps wrote:
>>>
>>>   Thanks for the reply Andy.  Unfortunately the users will be selecting
>>>> varying date ranges and new data is constantly coming in, so I am not
> sure
>>>> how I could archive/cache the necessary data that would be any more
>>>> efficient than simply using the database directly.
>>>>
>>>>
>>>> On 09/08/2011 02:16 PM, Andrew Moore wrote:
>>>>
>>>>> Thinking outside the query, is there any archiving that could happen
> to
>>>>> make
>>>>> your large tables kinder in the range scan?
>>>>>
>>>>> Andy
>>>>>
>>>>> On Thu, Sep 8, 2011 at 7:03 PM, Brandon
> Phelps<bphelps@stripped>
>>>>> wrote:
>>>>>
>>>>>   On 09/01/2011 01:32 PM, Brandon Phelps wrote:
>>>>>>
>>>>>>   On 09/01/2011 12:47 PM, Shawn Green (MySQL) wrote:
>>>>>>>
>>>>>>>   On 9/1/2011 09:42, Brandon Phelps wrote:
>>>>>>>>
>>>>>>>>   On 09/01/2011 04:59 AM, Jochem van Dieten wrote:
>>>>>>>>>
>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>   WHERE
>>>>>>>>>>> (open_dt>= '2011-08-30 00:00:00' OR
> close_dt>= '2011-08-30
>>>>>>>>>>>
>>>>>>>>>> 00:00:00')
>>>>>>>>>
>>>>>>>>>> AND (open_dt<= '2011-08-30 12:36:53' OR
> close_dt<= '2011-08-30
>>>>>>>>>>>
>>>>>>>>>> 12:36:53')
>>>>>>>>>
>>>>>>>>>   In that case your logic here simplifies to:
>>>>>>>>>> WHERE
>>>>>>>>>> open_dt>= '2011-08-30 00:00:00'
>>>>>>>>>> AND
>>>>>>>>>> close_dt<= '2011-08-30 12:36:53'
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   Now add an index over open_dt and close_dt and see
> what happens.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   Jochem
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jochem,
>>>>>>>>>
>>>>>>>>> I can't really use your WHERE logic because I also
> need to retrieve
>>>>>>>>> results where the open_dt time is out of the range
> specified. For
>>>>>>>>> example, a very large file download might span
> multiple days so
>>>>>>>>> given
>>>>>>>>> your logic if the connection was started 2 days ago
> and I want to
>>>>>>>>> pull 1
>>>>>>>>> days worth of connections, I would miss that entry.
> Basically I want
>>>>>>>>> to
>>>>>>>>> SELECT all of the records that were opened OR closed
> during the
>>>>>>>>> specified time period, ie. if any activity happened
> between my start
>>>>>>>>> and
>>>>>>>>> end dates, I need to see that record.
>>>>>>>>>
>>>>>>>>> Any other ideas?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   I believe Jochem was on the right track but he got
> his dates
>>>>>>>> reversed.
>>>>>>>>
>>>>>>>> Let's try a little ASCII art to show the situation. I
> will setup a
>>>>>>>> query
>>>>>>>> window with two markers (s) and (e). Events will be
> marked by |----|
>>>>>>>> markers
>>>>>>>> showing their durations.
>>>>>>>>
>>>>>>>> a) (s) (e)
>>>>>>>> b) |---|
>>>>>>>> c) |---|
>>>>>>>> d) |---|
>>>>>>>> e) |--------------------|
>>>>>>>> f) |---|
>>>>>>>> g) |---|
>>>>>>>>
>>>>>>>> To describe these situations:
>>>>>>>> a) is the window for which you want to query (s) is the
> starting time
>>>>>>>> and
>>>>>>>> (e) is the ending time for the date range you are
> interested in.
>>>>>>>> b) the event starts and stops before your window exists.
> It won't be
>>>>>>>> part
>>>>>>>> of your results.
>>>>>>>> c) the event starts before the window but ends within the
> window -
>>>>>>>> include this
>>>>>>>> d) the event starts and ends within the window - include
> this
>>>>>>>> e) the event starts before the window and ends after the
> window -
>>>>>>>> include
>>>>>>>> this
>>>>>>>> f) the event starts inside the window but ends beyond the
> window -
>>>>>>>> include this.
>>>>>>>> g) the event starts and ends beyond the window - exclude
> this.
>>>>>>>>
>>>>>>>> In order to get every event in the range of c-f, here is
> what you
>>>>>>>> need
>>>>>>>> for a WHERE clause
>>>>>>>>
>>>>>>>> WHERE start<= (ending time) and end>= (starting
> time)
>>>>>>>>
>>>>>>>> Try that and let us know the results.
>>>>>>>>
>>>>>>>>
>>>>>>> Thanks Jochem and Shawn, however the following two queries
> result in
>>>>>>> the
>>>>>>> exact same EXPLAIN output: (I hope the tables don't wrap too
> early for
>>>>>>> you)
>>>>>>>
>>>>>>> Old method:
>>>>>>> SELECT
>>>>>>> sc.open_dt,
>>>>>>> sc.close_dt,
>>>>>>> sc.protocol,
>>>>>>> INET_NTOA(sc.src_address) AS src_address,
>>>>>>> sc.src_port,
>>>>>>> INET_NTOA(sc.dst_address) AS dst_address,
>>>>>>> sc.dst_port,
>>>>>>> sc.sent,
>>>>>>> sc.rcvd,
>>>>>>> spm.desc AS src_port_desc,
>>>>>>> dpm.desc AS dst_port_desc
>>>>>>> FROM firewall_connections AS sc
>>>>>>> LEFT JOIN port_mappings AS spm ON spm.port = sc.src_port
>>>>>>> LEFT JOIN port_mappings AS dpm ON dpm.port = sc.dst_port
>>>>>>> WHERE
>>>>>>> (open_dt>= '2011-08-31 09:53:31' OR close_dt>=
> '2011-08-31 09:53:31')
>>>>>>> AND (open_dt<= '2011-09-01 09:53:31' OR close_dt<=
> '2011-09-01
>>>>>>> 09:53:31')
>>>>>>> ORDER BY rcvd DESC
>>>>>>> LIMIT 0, 10;
>>>>>>>
>>>>>>> New method with BTREE index on open_dt, close_dt (index name
> is
>>>>>>> ndx_open_close_dt):
>>>>>>> SELECT
>>>>>>> sc.open_dt,
>>>>>>> sc.close_dt,
>>>>>>> sc.protocol,
>>>>>>> INET_NTOA(sc.src_address) AS src_address,
>>>>>>> sc.src_port,
>>>>>>> INET_NTOA(sc.dst_address) AS dst_address,
>>>>>>> sc.dst_port,
>>>>>>> sc.sent,
>>>>>>> sc.rcvd,
>>>>>>> spm.desc AS src_port_desc,
>>>>>>> dpm.desc AS dst_port_desc
>>>>>>> FROM firewall_connections AS sc
>>>>>>> LEFT JOIN port_mappings AS spm ON spm.port = sc.src_port
>>>>>>> LEFT JOIN port_mappings AS dpm ON dpm.port = sc.dst_port
>>>>>>> WHERE
>>>>>>> open_dt<= '2011-09-01 09:53:31' AND close_dt>=
> '2011-08-31 09:53:31'
>>>>>>> ORDER BY rcvd DESC
>>>>>>> LIMIT 0, 10;
>>>>>>>
>>>>>>> EXPLAIN output for old method:
>>>>>>>
> +----+-------------+-------+--****------+---------------------**--**
>>>>>>>
> ----+----------+---------+----****----------------+------+----**
>>>>>>> --**-------+
>>>>>>> | id | select_type | table | type | possible_keys | key |
> key_len |
>>>>>>> ref |
>>>>>>> rows | Extra |
>>>>>>>
> +----+-------------+-------+--****------+---------------------**--**
>>>>>>>
> ----+----------+---------+----****----------------+------+----**
>>>>>>> --**-------+
>>>>>>> | 1 | SIMPLE | sc | index | open_dt,ndx_open_close_dt |
> ndx_rcvd | 4 |
>>>>>>> NULL | 10 | Using where |
>>>>>>> | 1 | SIMPLE | spm | eq_ref | PRIMARY | PRIMARY | 2 |
>>>>>>> syslog.sc.src_port |
>>>>>>> 1 | |
>>>>>>> | 1 | SIMPLE | dpm | eq_ref | PRIMARY | PRIMARY | 2 |
>>>>>>> syslog.sc.dst_port |
>>>>>>> 1 | |
>>>>>>>
> +----+-------------+-------+--****------+---------------------**--**
>>>>>>>
> ----+----------+---------+----****----------------+------+----**
>>>>>>> --**-------+
>>>>>>>
>>>>>>> EXPLAIN output for new method with new index:
>>>>>>>
> +----+-------------+-------+--****------+---------------------**--**
>>>>>>>
> ----+----------+---------+----****----------------+------+----**
>>>>>>> --**-------+
>>>>>>> | id | select_type | table | type | possible_keys | key |
> key_len |
>>>>>>> ref |
>>>>>>> rows | Extra |
>>>>>>>
> +----+-------------+-------+--****------+---------------------**--**
>>>>>>>
> ----+----------+---------+----****----------------+------+----**
>>>>>>> --**-------+
>>>>>>> | 1 | SIMPLE | sc | index | open_dt,ndx_open_close_dt |
> ndx_rcvd | 4 |
>>>>>>> NULL | 10 | Using where |
>>>>>>> | 1 | SIMPLE | spm | eq_ref | PRIMARY | PRIMARY | 2 |
>>>>>>> syslog.sc.src_port |
>>>>>>> 1 | |
>>>>>>> | 1 | SIMPLE | dpm | eq_ref | PRIMARY | PRIMARY | 2 |
>>>>>>> syslog.sc.dst_port |
>>>>>>> 1 | |
>>>>>>>
> +----+-------------+-------+--****------+---------------------**--**
>>>>>>>
> ----+----------+---------+----****----------------+------+----**
>>>>>>> --**-------+
>>>>>>>
>>>>>>> SHOW INDEX:
>>>>>>>
> +----------------------+------****------+-------------------+-**--**
>>>>>>>
> -----------+-------------+----****-------+-------------+------**--**
>>>>>>> --+--------+------+-----------****-+---------+
>>>>>>> | Table | Non_unique | Key_name | Seq_in_index | Column_name
> |
>>>>>>> Collation |
>>>>>>> Cardinality | Sub_part | Packed | Null | Index_type | Comment
> |
>>>>>>>
> +----------------------+------****------+-------------------+-**--**
>>>>>>>
> -----------+-------------+----****-------+-------------+------**--**
>>>>>>> --+--------+------+-----------****-+---------+
>>>>>>> | firewall_connections | 1 | ndx_open_close_dt | 1 | open_dt
> | A |
>>>>>>> 1342691
>>>>>>> | NULL | NULL | | BTREE | |
>>>>>>> | firewall_connections | 1 | ndx_open_close_dt | 2 | close_dt
> | A |
>>>>>>> 6377783 | NULL | NULL | | BTREE | |
>>>>>>>
> +----------------------+------****------+-------------------+-**--**
>>>>>>>
> -----------+-------------+----****-------+-------------+------**--**
>>>>>>> --+--------+------+-----------****-+---------+
>>>>>>>
>>>>>>>
>>>>>>> Although right now the queries do seem to be executing much
> faster,
>>>>>>> although I'm not quite sure why. And I'm not sure why the
> new
>>>>>>> ndx_open_close_dt isn't being used either.
>>>>>>>
>>>>>>> -Brandon
>>>>>>>
>>>>>>>
>>>>>>>   I am still having a big issue with my query as seen above. 
> The table
>>>>>> is up
>>>>>> to around 32 million records at the moment and either of the two
> SELECT
>>>>>> queries above take a very long time to run.  Is there anything at
> all I
>>>>>> can
>>>>>> do to speed things up?  It seems that changing the format of the
> WHERE
>>>>>> clause did not help at all, as the EXPLAIN output is exactly the
> same
>>>>>> for
>>>>>> both version.  I also tried adding an index on (open_dt,
> close_dt,
>>>>>> rcvd) but
>>>>>> that index does not get used.
>>>>>>
>>>>>> Any other ideas?
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>>
>>>>>> Brandon
>>>>>>
>>>>>> --
>>>>>> MySQL General Mailing List
>>>>>> For list archives: http://lists.mysql.com/mysql
>>>>>> To unsubscribe:   
> http://lists.mysql.com/mysql?****<http://lists.mysql.com/mysql?**>
>>>>>> unsub=eroomydna@stripped<http**://lists.mysql.com/mysql?**
>>>>>>
> unsub=eroomydna@stripped<http://lists.mysql.com/mysql?unsub=1
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> --
>>>> MySQL General Mailing List
>>>> For list archives: http://lists.mysql.com/mysql
>>>> To unsubscribe:    http://lists.mysql.com/mysql?**
>>>> unsub=mmanolov@stripped<http://lists.mysql.com/mysql?unsub=1
>>>>
>>>>
>>>
>>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe:    http://lists.mysql.com/mysql?**
>> unsub=eroomydna@stripped<http://lists.mysql.com/mysql?unsub=1
>>
>>
>
Thread
Query OptimizationBrandon Phelps30 Aug
  • Re: Query OptimizationMichael Dykman30 Aug
  • Re: Query OptimizationJochem van Dieten1 Sep
    • Re: Query OptimizationBrandon Phelps1 Sep
      • Re: Query OptimizationMySQL)1 Sep
        • Re: Query OptimizationBrandon Phelps1 Sep
          • Re: Query OptimizationBrandon Phelps8 Sep
            • Re: Query OptimizationAndrew Moore8 Sep
              • Re: Query OptimizationBrandon Phelps8 Sep
                • Re: Query OptimizationAndrew Moore8 Sep
                  • Re: Query OptimizationBrandon Phelps8 Sep
                    • Re: Query OptimizationDerek Downey8 Sep
                      • Re: Query OptimizationBrandon Phelps8 Sep
                        • Re: Query OptimizationMihail Manolov8 Sep
                          • Re: Query OptimizationBrandon Phelps8 Sep
                            • Re: Query OptimizationMihail Manolov8 Sep
                              • Re: Query OptimizationBrandon Phelps8 Sep
                • Re: Query OptimizationMihail Manolov8 Sep
                  • Re: Query OptimizationBrandon Phelps8 Sep
                    • Re: Query OptimizationAndrew Moore8 Sep
                      • Re: Query OptimizationBrandon Phelps8 Sep