List:General Discussion« Previous MessageNext Message »
From:Tim Gustafson Date:June 24 2010 6:33pm
Subject:selecting data for computation
View as plain text  
I'm working on a compute job that queries mySQL for work packets,
processes the work packet and then updates the row in mySQL with the
results of the job.

Currently, from each compute instance I'm doing something like:

select * from work_packets where work_data is null order by rand() limit 1;

That more or less works, and occasionally two jobs will process the same
row and I'll waste a little bit of CPU time, but as a higher percentage of
jobs get completed, duplicate work happens more and more.

I was looking at something like:

select * from work_packets where work_data is null order by rand() limit 1
for update;

and

select * from work_packets where work_data is null order by rand() limit 1
lock in share mode;

but those aren't exactly what I need - they don't cause the second compute
instance to move on to another row; instead they cause the second compute
job to wait for the first one to finish and -then- they re-do the same
work that the first job just completed.

I also thought about doing something like:

select * from work_packets where work_data is null and (select
get_lock(work_packets.id, 0) = 1 order by rand() limit 1;

but that seems to significantly increase the time it takes to run this
query from about 2 seconds to about 13 seconds.  Since the jobs only take
about 10 seconds of CPU time each, this represents in an actual slow-down
as opposed to a speed-up.

Is there any sort of query that I can do that will return a single row to
me that another compute job hasn't already started working on, other than
doing something like:

select * from work_packets where work_data is null and working = 0 order
by rand() limit 1 for update;
update work_packets set working = 1 where id = <id from select query>;

The reason I don't want to use this approach is that sometimes the compute
jobs die in the middle of their work, which would leave these records
without any work_data, but marked as "working", and those records would
have to be manually cleaned up at some point.

I feel like I'm missing something really basic here, but I can't seem to
find the answer.  Does anyone have any suggestions?

Tim Gustafson
831-332-1496
tjg@stripped
http://tgustafson.com/

Thread
selecting data for computationTim Gustafson24 Jun
  • Re: selecting data for computationDan Nelson24 Jun
    • Re: selecting data for computationTim Gustafson25 Jun
      • Re: selecting data for computationTim Gustafson25 Jun
        • Re: selecting data for computationDan Nelson25 Jun
          • Re: selecting data for computationTim Gustafson26 Jun
            • Re: selecting data for computationBaron Schwartz28 Jun
              • Re: selecting data for computationTim Gustafson1 Jul
      • Re: selecting data for computationJoerg Bruehe25 Jun