>>>>> "clewis" == clewis <clewis@stripped> writes:
clewis> We have some code which will run for a while before being unable
clewis> to connect to a remote server. When the client is unable to
clewis> connect, mysql_error() returns an empty string.
clewis> The code wants to retrieve information about all users. Instead
clewis> of retrieving all information, we iterate over all users by
clewis> connecting to the database, retrieving a single row, then
clewis> disconnecting. The attached code is a very simplified version
clewis> of this, written for testing purposes. For testing purposes,
clewis> I added a section to re-try the connection after a 5 second
clewis> sleep if the mysql_connect() fails. We are re-writting our
clewis> code to work around this (see Fix section), but I am concerned
clewis> about the root cause. I belive this is a problem in the client,
clewis> although I can't prove it. If it is the client, then I can work
clewis> around it with no worries. If the problem is in the server, then
clewis> our server will being having more problems in the near future as
clewis> the load increases.
clewis> The setup: We have two development machines, named hermes and
clewis> matrix. Matrix is the big machine, a PII/350, 128 Meg Ram running
clewis> RedHat 5.2. Hermes is a AMD-K6/350, 32 Meg Ram running RedHat 6.0.
clewis> Matrix has MySQL 3.21.33c client/server. Hermes has MySQL 3.22.27
clewis> client/server. Matrix is the build machine, and all code
clewis> that I compiled was built on matrix. Both machines are on a
clewis> 100baseT subnet. We orginally saw the problem on the beefy
clewis> servers, (Source distribution) but everything was sucessfully
clewis> re-produced on the smaller development machines. The web server
clewis> is a Dual PII/450 with 512 Meg RAM and the MySQL server machine
clewis> is a Quad XeonIII/450 with 2 Gig RAM.
clewis> Testing configurations:
clewis> The C code was execute 4 different ways.
clewis> 1) Running on matrix, connecting to MySQL on matrix.
clewis> 2) Running on matrix, connecting to MySQL on hermes.
clewis> 3) Running on hermes, connecting to MySQL on hermes.
clewis> 4) Running on hermes, connecting to MySQL on matrix.
clewis> The first 3 cases work fine. The 4th case always results in a
clewis> "Can't connect to server" error. The error usually occurs in the
clewis> 4000 +/- 100th connection attempt, but not always. Most of the
clewis> time, the connection would be re-attempted after a sleep, and the
clewis> re-attempt would fail. Ocassionaly the re-attempt would work, but
clewis> the next iteration both the 1st & 2nd attempt would fail. Rarely,
clewis> we would see a string of 1st connections failing, but 2nd attempts
clewis> going through, with both attempts eventually failing (This is
clewis> the example that I've provided below). I've never seen test
clewis> case 4 complete all 15000 connections sucessfully. We tried
clewis> upgrading MySQL on both machines to 3.22.27, but it was worse.
clewis> With both machines running 3.22.27 (RPM Distribution), we were
clewis> only able to get 1023 repeat connection before we couldn't
clewis> connect, with the 2nd attempt always failing. I can't remember
clewis> if this happened in all four test cases, or just test cases
clewis> 2 and 4.
The problem is that Linux has a delay between you do close on a TCP/IP
socket and until this is actually freed. As there is only room for a
finite number of TCP/IP slots you will get the problem after a while.
I have mailed about this problem a couple of times to different
mailing lists and I have even talked with Alan Cox, but I have never
been able to resolve this properly.
Note that Linux 2.0 doesn't have this problem; I have only seen this
with 2.2 kernels!
Here is a simple program you can use to check this:
After compiling it you should run it as follows:
./server_client 'your host name'
if you have a problem with the tcp/ip connections, it should stop
after about 2000-4000 connections
I did just test this on my Linux 2.2.12 kernel and it WORKED!
(It didn't, when I last tested it a couple of months ago with an
earlier 2.2 kernel)
In other words, this is a server problem that may be fixed by
upgrading to a newer kernel!