Now we use DolphinScheduler version 2.0.5 to build our data platform, and we want to improve our platform stability and find the bugs as soon as pissboy, so we want to introduce an E2E test in our project. Writing for recording enabling the E2E test process and the resolution of the Could not connect to Ryuk
question.
I. Causes
1、E2E Test
The community has a stable E2E testing module but it does not support version 2. x and can only be used in versions 3.0+ and above. Instead, the community solution is to build a new Docker image from the source code and use the Selenium framework to access the Docker container for testing.
But we have second coding from the community release version, integrated some jobs schooled like K8s, Flink, Spark, etc. And we need to test for Flink UI can access as an expert after the Flink job starts. So, using the Selenium framework to test the online service is better for our situation. So we plan to second coding from the community E2E module, then introduce it to our project.
2. Code Changing
The code changes aren’t really that much, The community’s E2E module has the logic of testing the local service. So we just need to change the E2E module about testing the local service to testing the online service.
1 | DolphinSchedulerExtension.java |
If your service uses https, change the protocol to https.
1 | driver.get(new URL("http", address.getHost(), address.getPort(), rootPath).toString()); |
Other start params changes and E2E detail logic you can see the official document here.
And we need to change the login logic and another case we want to test. Because Selenium framework is according website class name to position the add and delete button and simulate the browser to operate and test. Version 3.x refactored the front end, so the elements of the website are totally different between 2.x, we need to change the code of this area. Its part doesn’t have any difficulty, I won’t go into details.
3. Test local
After code changing, test local follow document guide, I need to add -Dm1_chip=true
for my run param because my mac has M1 chip, it’s using to support ARM64
container.
Locally test run successfully, it can log in to our online service as expected, and test the tenant manager process.
We have deployed the Jenkins service, and our E2E test will run in Jenkins finally. We use K8s mode in our Jenkins service. Every new task will create a new pod and will be destroyed after finishing the task.
We have a successful test locally, and the next step will run the test in Jenkins, we thought will be the same result as locally, because our Unit Test and Online auto build have been running stably in Jenkins for several months, but something went wrong in the E2E module.
4. Test In Jenkins
Something went wrong in the E2E module when running in the Jenkins:
1 | 2022-08-06 14:49:54,714 org.testcontainers.DockerClientFactory 190 [main] INFO [] - Connected to docker: |
The error message is very clear, Could not connect to Ryuk.
II. Troubleshoot
1. Preliminary check
The first step when encountering a problem is to Google it. And I found it was a common problem. There are a lot of solutions on the web, like:
- Somebody see when if you restart Docker can resolve the problem, but not works for me
- Upgrade testcontainers-java to version 1.51.1, but I using 1.17.3 now
- docker system prune. I have checked Docker has enough memory space, so I didn’t try this one
- Disable Ryuk using config environment params TESTCONTAINERS_RYUK_DISABLED=true to avoid this problem. After researching, Ryuk is purge containers after tasks are finished, I have tried this one, and sure enough, it will not report an error if Ryuk does not start. But produces another error:
1 | 2022/08/06 08:51:49,364 docker[testcontainers/sshd:1.1.0] 440 [main] INFO [] - Container testcontainers/sshd:1.1.0 is starting: a9a5ccc8addcedb374a290675893291214d2123b6e2a5dc5e30b9e54aa01e828 |
See the error message is still can not connect with 172.17.0.1:32812, It means that no matter what the service is, it is actually impossible to connect. That is to say, Ryuk can not stop. this can not connect problem must be resolved.
2. Could not connect
The problem comes back to Ryuk, check the log again, and we can see that the Ryuk service has actually been started:
1 | 2022-08-06 14:49:55,088 docker[testcontainers/ryuk:0.3.3] 440 [main] INFO [] - Container testcontainers/ryuk:0.3.3 is starting: 066d1e071dfc6bce3a71c1321a4331d1b388f50ac25d7c9de3f41f31447500e2 |
Log in to Jenkins pod to check whether Ryuk service is actually starting or not.
1 | root@jenkins-pipeline-ln62r:~/agent# docker ps | grep ryuk |
We can see Ryuk service is starting, the question is could not connect, so we test that with wget command.
1 | root@jenkins-pipeline-ln62r:~/agent# wget 172.17.0.1:32772 |
Actually, we can see that it is indeed unable to connect. We log in to the host machine to test the service because the Jenkins task pod uses the host machine’s /run/docker.sock file.
1 | [root@k8s-worker-01 ~]# curl 172.17.0.1:32772 |
The connection is successfully in the host machine.
And that time, I thought maybe is the networking setting problem with testcontainers. I have changed some parameters about Docker following the document, but still not working. Looks like this has nothing to do with the testcontainers. I thought maybe the question is still in Docker.
3. Firewall Troubleshooting
Rethink the problem, every new Jenkins’s task will create a new Docker container, and the Docker container will create a Ryuk service, and could not connect. At that time, I thought the problem is in Docker in Docker connection was refused. Suspect that Docker in Docker can not access the host machine port.
After thinking, I thought maybe it is a firewall issue because can not connect.
After inquiries, it is suspected that the host machine’s iptables rule denied the connection from Docker. So use the below command to allow all traffic from Docker to access all host’s ports.
1 | iptables -A INPUT -i docker0 -j ACCEPT |
But still not resolved the problem.
Saw this information here, and suspected that is host’s Docker configuration has issues and denied the access. So tried to change the host’s Docker configuration.
1 | ExecStart=/usr/bin/dockerd -H tcp://0.0.0.0:4243 -H unix:///var/run/docker.sock |
But still can’t solve the problem.
Suspected still host machine’s firewall blocked some access, so I disable the host machine’s firewall.
Still, the connection refused.
At the time I have seen the same problem on Github, Following the prompts, I used httpd to test.
1 | docker run -d -p 9090:80 httpd |
It is found that the httpd service cannot be accessed.
That means our thinking is wrong, it is nothing to do with the firewall. Because all permissions are turned on, all firewalls are closed, but still can not access the service.
This time I found it is not easy to test, I need to run mvn commands to start the E2E test, and a Ryuk service will be created, and Ryuk will be destroyed after mvn commands are finished, and we can’t test the connectivity. Ryuk service only stays one or two minutes usually, and testing is extremely inconvenient. And I test the httpd service is also unable to connect, I found I can start an httpd service in the Jenkins task pod, and will not be destroyed. When I can access the httpd service from the pod, it means I resolved the problem.
4. Network Troubleshooting
Looks like there is a network issue.
Let’s check the Jenkins task pod network configuration.
1 | root@jenkins-pipeline-w34kq:~/agent# ifconfig |
It looks cool, Jenkins task pod use bridge
networks, and the host will have a virtual network card, IP segment 172.17.0.0/24.
Log in to the host machine to check the network configuration.
1 | [root@k8s-worker-01 ~]# ifconfig |
Looks good too.
It’s kind of deadlocked now, I don’t know which configuration is wrong that is preventing the connection.
Can I change my mind, I can access the Ryuk service on the host machine, and Jenkins task pod using the host machine’s docker.sock file and can not connect. So can we integrate a full Docker service in the Jenkins task pod? Using its own docker.sock I thought will connect to the Ryuk service.
After researching documents, it’s also more different to implement, and I thought it’s not the best-resolved solution, I don’t think this is the root cause. This is equivalent to bypassing the problem and does not solve the core problem.
5. Host Mode Start
In the follow-up inquiries, I did not see this information by accident. If I start Nginx with host
mode in a pod, can I access the Nginx service from this pod?
1 | root@jenkins-pipeline-q68td:~/agent# docker run -d --name nginx --network host nginx |
I can’t believe that can access it. It means that start the docker container with host
mode, we can access this from the pod.
Does it mean if I start testcontainers with host
mode, I also can access this service, and the problem will be resolved?
Now the thinking is very clear, start the testcontainers with host
mode.
I have tried changing the Java code, and specifying the start params with the host
network, but not working. Another solution is to change docker starts params with host
mode, but I can’t find the way. And after researching I can’t find another way to config the testcontainers to start with host
mode.
Deadlocked again.
6. Jenkins With Host Mode
Then change my mind again, If I can change the Jenkins slaver pod with host
mode, can I access other pods launching with bridge
networks?
Jenkins other including this feature, just enable Host Network
in Pod Templates module, configureClouds page, and change Jenkins URL
to host
mode’s URL.
After checking, there are no problems.
1 | root@pdak8s-worker-01:~/agent# docker run -d -p 9090:80 httpd |
We rerun the E2E test to see if it can access the Ryuk service.
1 | 2022-08-06 10:59:48,874 docker[testcontainers/ryuk:0.3.3] 440 [main] INFO [] - Container testcontainers/ryuk:0.3.3 is starting: 0859ffdf6c6b6c06e0ea39acf6155aaa98bb837f2926640af8acd8d268d0d34b |
No problem at all, which means our problem is finally solved.
Ⅲ. Summarize
Now we look back, our original thinking was problematic, actually is not Docker In Docker.
Jenkins task pod is using the same /var/run/docker.sock file with the host machine.It means the Jenkins task pod uses the same Docker service as the host machine, therefore the Ryuk service started by the Jenkins task is actually started on the Docker of the host machine. That is Jenkins task pod and Ryuk are two Docker services at the same level on the host.
So I took a lot of detours, but fortunately, the problem was finally solved. I have left a message on the same problem in Github, because don’t repeat yourself.
Solved new problems, in fact, also led to new problems. Our Java service is also deployed in Docker and will connect with other services, like Nginx, and Redis, there serviced are also other Docker services, and they can connect with each other normally. So why Jenkins task pod can not connect with Ryuk? Two containers started in bridge
mode.
To be continued.