List:General Discussion« Previous MessageNext Message »
From:Brandon Phelps Date:September 8 2011 7:04pm
Subject:Re: Query Optimization
View as plain text  
Mihail,

I have considered this but have not yet determined how best to go about partitioning the
table.  I don't think partitioning by dst_address or src_address would help because most
of the queries do not filter on IP address (except very specific queries where the
end-user is searching the table for history on a particular employee).

I could potentially partition the table based on the day of week the connection was opened
on which may improve performance for a while since this would take me from a single
32million record table down to roughly 4.5 million records per partition (32 / 7) however
we are looking to eventually store up to 2 months worth of data in the table, and so far
that 32 million records is only for 1 month, so I estimate another 32 million-ish before
the month is out, bringing me to roughly 70 million records total (it would be nice if I
could store even more than 2 months, but given my currently performance dilemma I don't
expect that to happen).  Also this does not take into account that the end-user will
often be pulling data for multiple days at a time, meaning that multiple partitions in
this scheme will need to be accessed anyway.

The only other logical partitioning scheme I can think of would be to partition based on
dst_port (the port the log relates to) but the majority of records are all to port 80
(standard web traffic) so I don't think this would be all that helpful.

I have never messed with partitioning so it is totally possible that I am not thinking of
something, so if you have any ideas on a decent partitioning scheme based on my criteria
and queries below, please let me know.

Thanks,
Brandon
  

On 09/08/2011 02:47 PM, Mihail Manolov wrote:
> If you're running version 5.1+ you may wanna take a look at table partitioning
> options you may have.
>
> On Sep 8, 2011, at 2:27 PM, Brandon Phelps wrote:
>
>> Thanks for the reply Andy.  Unfortunately the users will be selecting varying
> date ranges and new data is constantly coming in, so I am not sure how I could
> archive/cache the necessary data that would be any more efficient than simply using the
> database directly.
>>
>>
>> On 09/08/2011 02:16 PM, Andrew Moore wrote:
>>> Thinking outside the query, is there any archiving that could happen to make
>>> your large tables kinder in the range scan?
>>>
>>> Andy
>>>
>>> On Thu, Sep 8, 2011 at 7:03 PM, Brandon Phelps<bphelps@stripped>  
> wrote:
>>>
>>>> On 09/01/2011 01:32 PM, Brandon Phelps wrote:
>>>>
>>>>> On 09/01/2011 12:47 PM, Shawn Green (MySQL) wrote:
>>>>>
>>>>>> On 9/1/2011 09:42, Brandon Phelps wrote:
>>>>>>
>>>>>>> On 09/01/2011 04:59 AM, Jochem van Dieten wrote:
>>>>>>>>> ...
>>>>>>>
>>>>>>>>> WHERE
>>>>>>>>> (open_dt>= '2011-08-30 00:00:00' OR close_dt>=
> '2011-08-30
>>>>>>> 00:00:00')
>>>>>>>>> AND (open_dt<= '2011-08-30 12:36:53' OR
> close_dt<= '2011-08-30
>>>>>>> 12:36:53')
>>>>>>>
>>>>>>>> In that case your logic here simplifies to:
>>>>>>>> WHERE
>>>>>>>> open_dt>= '2011-08-30 00:00:00'
>>>>>>>> AND
>>>>>>>> close_dt<= '2011-08-30 12:36:53'
>>>>>>>
>>>>>>>> Now add an index over open_dt and close_dt and see what
> happens.
>>>>>>>
>>>>>>>> Jochem
>>>>>>>
>>>>>>> Jochem,
>>>>>>>
>>>>>>> I can't really use your WHERE logic because I also need to
> retrieve
>>>>>>> results where the open_dt time is out of the range specified.
> For
>>>>>>> example, a very large file download might span multiple days
> so given
>>>>>>> your logic if the connection was started 2 days ago and I
> want to pull 1
>>>>>>> days worth of connections, I would miss that entry. Basically
> I want to
>>>>>>> SELECT all of the records that were opened OR closed during
> the
>>>>>>> specified time period, ie. if any activity happened between
> my start and
>>>>>>> end dates, I need to see that record.
>>>>>>>
>>>>>>> Any other ideas?
>>>>>>>
>>>>>>>
>>>>>> I believe Jochem was on the right track but he got his dates
> reversed.
>>>>>>
>>>>>> Let's try a little ASCII art to show the situation. I will setup
> a query
>>>>>> window with two markers (s) and (e). Events will be marked by
> |----| markers
>>>>>> showing their durations.
>>>>>>
>>>>>> a) (s) (e)
>>>>>> b) |---|
>>>>>> c) |---|
>>>>>> d) |---|
>>>>>> e) |--------------------|
>>>>>> f) |---|
>>>>>> g) |---|
>>>>>>
>>>>>> To describe these situations:
>>>>>> a) is the window for which you want to query (s) is the starting
> time and
>>>>>> (e) is the ending time for the date range you are interested in.
>>>>>> b) the event starts and stops before your window exists. It won't
> be part
>>>>>> of your results.
>>>>>> c) the event starts before the window but ends within the window
> -
>>>>>> include this
>>>>>> d) the event starts and ends within the window - include this
>>>>>> e) the event starts before the window and ends after the window -
> include
>>>>>> this
>>>>>> f) the event starts inside the window but ends beyond the window
> -
>>>>>> include this.
>>>>>> g) the event starts and ends beyond the window - exclude this.
>>>>>>
>>>>>> In order to get every event in the range of c-f, here is what you
> need
>>>>>> for a WHERE clause
>>>>>>
>>>>>> WHERE start<= (ending time) and end>= (starting time)
>>>>>>
>>>>>> Try that and let us know the results.
>>>>>>
>>>>>
>>>>> Thanks Jochem and Shawn, however the following two queries result in
> the
>>>>> exact same EXPLAIN output: (I hope the tables don't wrap too early
> for you)
>>>>>
>>>>> Old method:
>>>>> SELECT
>>>>> sc.open_dt,
>>>>> sc.close_dt,
>>>>> sc.protocol,
>>>>> INET_NTOA(sc.src_address) AS src_address,
>>>>> sc.src_port,
>>>>> INET_NTOA(sc.dst_address) AS dst_address,
>>>>> sc.dst_port,
>>>>> sc.sent,
>>>>> sc.rcvd,
>>>>> spm.desc AS src_port_desc,
>>>>> dpm.desc AS dst_port_desc
>>>>> FROM firewall_connections AS sc
>>>>> LEFT JOIN port_mappings AS spm ON spm.port = sc.src_port
>>>>> LEFT JOIN port_mappings AS dpm ON dpm.port = sc.dst_port
>>>>> WHERE
>>>>> (open_dt>= '2011-08-31 09:53:31' OR close_dt>= '2011-08-31
> 09:53:31')
>>>>> AND (open_dt<= '2011-09-01 09:53:31' OR close_dt<= '2011-09-01
>>>>> 09:53:31')
>>>>> ORDER BY rcvd DESC
>>>>> LIMIT 0, 10;
>>>>>
>>>>> New method with BTREE index on open_dt, close_dt (index name is
>>>>> ndx_open_close_dt):
>>>>> SELECT
>>>>> sc.open_dt,
>>>>> sc.close_dt,
>>>>> sc.protocol,
>>>>> INET_NTOA(sc.src_address) AS src_address,
>>>>> sc.src_port,
>>>>> INET_NTOA(sc.dst_address) AS dst_address,
>>>>> sc.dst_port,
>>>>> sc.sent,
>>>>> sc.rcvd,
>>>>> spm.desc AS src_port_desc,
>>>>> dpm.desc AS dst_port_desc
>>>>> FROM firewall_connections AS sc
>>>>> LEFT JOIN port_mappings AS spm ON spm.port = sc.src_port
>>>>> LEFT JOIN port_mappings AS dpm ON dpm.port = sc.dst_port
>>>>> WHERE
>>>>> open_dt<= '2011-09-01 09:53:31' AND close_dt>= '2011-08-31
> 09:53:31'
>>>>> ORDER BY rcvd DESC
>>>>> LIMIT 0, 10;
>>>>>
>>>>> EXPLAIN output for old method:
>>>>> +----+-------------+-------+--**------+-----------------------**
>>>>>
> ----+----------+---------+----**----------------+------+------**-------+
>>>>> | id | select_type | table | type | possible_keys | key | key_len |
> ref |
>>>>> rows | Extra |
>>>>> +----+-------------+-------+--**------+-----------------------**
>>>>>
> ----+----------+---------+----**----------------+------+------**-------+
>>>>> | 1 | SIMPLE | sc | index | open_dt,ndx_open_close_dt | ndx_rcvd | 4
> |
>>>>> NULL | 10 | Using where |
>>>>> | 1 | SIMPLE | spm | eq_ref | PRIMARY | PRIMARY | 2 |
> syslog.sc.src_port |
>>>>> 1 | |
>>>>> | 1 | SIMPLE | dpm | eq_ref | PRIMARY | PRIMARY | 2 |
> syslog.sc.dst_port |
>>>>> 1 | |
>>>>> +----+-------------+-------+--**------+-----------------------**
>>>>>
> ----+----------+---------+----**----------------+------+------**-------+
>>>>>
>>>>> EXPLAIN output for new method with new index:
>>>>> +----+-------------+-------+--**------+-----------------------**
>>>>>
> ----+----------+---------+----**----------------+------+------**-------+
>>>>> | id | select_type | table | type | possible_keys | key | key_len |
> ref |
>>>>> rows | Extra |
>>>>> +----+-------------+-------+--**------+-----------------------**
>>>>>
> ----+----------+---------+----**----------------+------+------**-------+
>>>>> | 1 | SIMPLE | sc | index | open_dt,ndx_open_close_dt | ndx_rcvd | 4
> |
>>>>> NULL | 10 | Using where |
>>>>> | 1 | SIMPLE | spm | eq_ref | PRIMARY | PRIMARY | 2 |
> syslog.sc.src_port |
>>>>> 1 | |
>>>>> | 1 | SIMPLE | dpm | eq_ref | PRIMARY | PRIMARY | 2 |
> syslog.sc.dst_port |
>>>>> 1 | |
>>>>> +----+-------------+-------+--**------+-----------------------**
>>>>>
> ----+----------+---------+----**----------------+------+------**-------+
>>>>>
>>>>> SHOW INDEX:
>>>>> +----------------------+------**------+-------------------+---**
>>>>> -----------+-------------+----**-------+-------------+--------**
>>>>> --+--------+------+-----------**-+---------+
>>>>> | Table | Non_unique | Key_name | Seq_in_index | Column_name |
> Collation |
>>>>> Cardinality | Sub_part | Packed | Null | Index_type | Comment |
>>>>> +----------------------+------**------+-------------------+---**
>>>>> -----------+-------------+----**-------+-------------+--------**
>>>>> --+--------+------+-----------**-+---------+
>>>>> | firewall_connections | 1 | ndx_open_close_dt | 1 | open_dt | A |
> 1342691
>>>>> | NULL | NULL | | BTREE | |
>>>>> | firewall_connections | 1 | ndx_open_close_dt | 2 | close_dt | A |
>>>>> 6377783 | NULL | NULL | | BTREE | |
>>>>> +----------------------+------**------+-------------------+---**
>>>>> -----------+-------------+----**-------+-------------+--------**
>>>>> --+--------+------+-----------**-+---------+
>>>>>
>>>>>
>>>>> Although right now the queries do seem to be executing much faster,
>>>>> although I'm not quite sure why. And I'm not sure why the new
>>>>> ndx_open_close_dt isn't being used either.
>>>>>
>>>>> -Brandon
>>>>>
>>>>>
>>>> I am still having a big issue with my query as seen above.  The table is
> up
>>>> to around 32 million records at the moment and either of the two SELECT
>>>> queries above take a very long time to run.  Is there anything at all I
> can
>>>> do to speed things up?  It seems that changing the format of the WHERE
>>>> clause did not help at all, as the EXPLAIN output is exactly the same
> for
>>>> both version.  I also tried adding an index on (open_dt, close_dt, rcvd)
> but
>>>> that index does not get used.
>>>>
>>>> Any other ideas?
>>>>
>>>> Thanks in advance,
>>>>
>>>>
>>>> Brandon
>>>>
>>>> --
>>>> MySQL General Mailing List
>>>> For list archives: http://lists.mysql.com/mysql
>>>> To unsubscribe:    http://lists.mysql.com/mysql?**
>>>> unsub=eroomydna@stripped<http://lists.mysql.com/mysql?unsub=1
>>>>
>>>>
>>>
>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe:    http://lists.mysql.com/mysql?unsub=1
>>
>
>
Thread
Query OptimizationBrandon Phelps30 Aug
  • Re: Query OptimizationMichael Dykman30 Aug
  • Re: Query OptimizationJochem van Dieten1 Sep
    • Re: Query OptimizationBrandon Phelps1 Sep
      • Re: Query OptimizationMySQL)1 Sep
        • Re: Query OptimizationBrandon Phelps1 Sep
          • Re: Query OptimizationBrandon Phelps8 Sep
            • Re: Query OptimizationAndrew Moore8 Sep
              • Re: Query OptimizationBrandon Phelps8 Sep
                • Re: Query OptimizationAndrew Moore8 Sep
                  • Re: Query OptimizationBrandon Phelps8 Sep
                    • Re: Query OptimizationDerek Downey8 Sep
                      • Re: Query OptimizationBrandon Phelps8 Sep
                        • Re: Query OptimizationMihail Manolov8 Sep
                          • Re: Query OptimizationBrandon Phelps8 Sep
                            • Re: Query OptimizationMihail Manolov8 Sep
                              • Re: Query OptimizationBrandon Phelps8 Sep
                • Re: Query OptimizationMihail Manolov8 Sep
                  • Re: Query OptimizationBrandon Phelps8 Sep
                    • Re: Query OptimizationAndrew Moore8 Sep
                      • Re: Query OptimizationBrandon Phelps8 Sep