General Concerns #140

benwilson512 · 2015-12-23T07:42:09Z

Over the course of working through the PR to fix #128 and #129 I've run into a variety of different problems that I'd like to discuss. These problems are unfortunate, because Exq serves a very important role in the community. Many people are coming from a Ruby / Rails background and Exq helps make integrating Elixir with existing Ruby systems much easier. Nonetheless, the problems I present here individually and in aggregate are serious enough to warrant great caution on the part of any who might consider using this library.

I bring these problems up not only because I want Exq to succeed, but because I only have the available time to fix a handful of the most serious ones. A lot of valuable work has gone into this library, particularly around managing the redis association in such a way as to be compatible with Sidekiq. I'm hopeful that this post will bring about awareness of the situation so that others can address what I cannot.

NOTE: Two of these I've already created separate issues for, but I include them here for the sake of completeness.

* indicates that it will be at least partially addressed in benwilson512#1

Severe Issues

*start_link calls inside GenServer init callback
- https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L41
- Breaks supervision guarantees about children termination
- Breaks hot code loading
Job Workers use GenServer.start calls
- https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L203
- Totally unsupervised
GenServers include catch all handle_call and handle_cast clauses
- Only handle_info should have a catch all clause
*Only a single redis connection is shared throughout the entire application, sorta.
- The redis connection is always unsupervised
- Certain GenServers are configured to start_link a redis connection in their init function if they aren't passed one by their own start_link.
Jobs restarting is handled not by the scheduler, nor by the manager, nor even the enqueuer, but rather the stats server
- https://github.com/akira/exq/blob/master/lib/exq/stats/server.ex#L66

The ambiguities around supervision in general and job supervision in particular mean that I have no idea how this library behaves when I call ``./bin/myapp stop` via exrm for example. While it shuts down is it possible for a job to still get started? What happens to running jobs? Clarity here is paramount as I need to be sure that deploying new code to production will not orphan jobs.

General Issues

These are a hodgepodge of anti-patterns, code smells, and barriers to contribution.

*No real top level Supervisor
*Every GenServer has its own one_for_one supervisor for no discernable reason.
*Hardcoded configuration defaults (not in themselves bad) exist in many disparate locations, often redundantly.
- https://github.com/akira/exq/blob/master/lib/exq/enqueuer/server.ex#L32
- https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L36
- https://github.com/akira/exq/blob/master/lib/exq/redis/connection.ex#L14
- https://github.com/akira/exq/blob/master/lib/exq/scheduler/scheduler.ex#L30
- These should be consolidated in the already existing Config module.
Max restarts in supervisors are higher than normal, and in some cases VERY high.
- https://github.com/akira/exq/blob/master/lib/exq/manager/supervisor.ex#L16
Configuration not actually passed through to all children
- The {:ok, sup} = Exq.Enqueuer.start_link([port: 6379]) example from the readme would seem to indicate that individual components of the system can be started with custom configuration. However, the various hardcoded Code.get calls mean that some parts will still fallback to Exq defaults whereas others will actually use the provided configuration.
Queue state is largely duplicated between the manager's process state and an :ets table that is only accessed by that same process.
- https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L206
- The :ets stuff should likely just be removed.
handle_call({:retries}, _from, state) why is :retries wrapped in a tuple? This pattern is ubiquitous.
GenServer.start_link(__MODULE__, [opts], []) and then init([opts])
- Also ubiquitous. The needless wrapping and unwrapping of values adds noise and confusion.
handle_info catch all clauses with Logger.error("INVALID MESSAGE #{info}")
- Unlike handle_call and handle_cast, handle_info is not constrained to only receive pre-defined messages. It is not necessary to go out of the way to Logger.error them.
handle_info / terminate / stop clauses included in modules that don't do anything. use GenServer provides these basic callbacks already
Every test has Code.require_file "test_helper.exs", __DIR__ at the top. mix test runs test_helper by default
api.ex and enqueuer.ex contain a bunch of identical functions that do a GenServer.call or cast to an unspecified pid.
Tests are very brittle
- The seed is hard coded
- Global state via mix config and process name registration abounds
- :timer.sleep calls abound.
Documentation: none
TypeSpec: none

Going Forward

These are the problems I've run into while trying to fix the supervision architecture. I have not yet had time to review in detail job lifecycle management, anything related to redis usage, nor the messaging patterns of the application.

I hope to have time during the next week or so to finish the changes I've started to improve the supervision tree. Notably, this may not include supervising the workers, as that should probably be addressed as part of a general look at job life cycle management. We will see.

Afterward, I believe the most useful next step would be documentation. What does each GenServer do? How does it interact with the other GenServers? What is the lifecycle of a job? Some clarity to these questions would go a long way towards facilitating further contributions that can improve upon the current state of affairs.

Once again I really do appreciate the effort that's been put into Exq, it is a valuable part of the Elixir community.

Thanks!

The text was updated successfully, but these errors were encountered:

akira · 2015-12-23T19:06:47Z

Hi Ben,

Thanks for the detailed writeup.

I agree that there are definitely issues to be resolved, if we can discuss and prioritize those then we can create issues and start fixing them. Recently I have been focusing on getting the Redis job queueing and execution working with retries and reliable queueing - so that job are retried when failed, and in progress jobs are not lost if a worker dies (or orphan jobs as you mentioned). I wanted to get these in before doing any major architectural changes, and I feel like the system works well now in practice. But as you mentioned, it can definitely can be improved to make it easier to maintain as well as more idiomatic in terms of architecture, syntax, and supervision (I started committing to this in 2013 when there weren't as many great Elixir resources as today)

I think the timing is great now to take a step back and look at some of the patterns used and take a stab at improving the code base. It also looks like you're handling a good amount of these which is great.

I definitely agree with most of the things you mentioned. These are issues to be addressed (besides the ones included in your refactor):

documentation would be there at least at a high level for each GenServer and detailing the job lifecycle.
Shutdown process - as you mentioned, I think this can be improved to make sure supervision is setup so that shutdown works correctly
Improving worker supervision so that workers shut down along with the entire tree. Right now, worker processes are ephemeral and spawned for each job, but this is an issue with shutdown, and I think could be improved overall. Note that if worst case a worker crashes, the job is saved off to a "backup_queue" atomically as it is popped off the job queue, and gets re-enqueued in system start, so the job wouldn't get lost.
Queue state - this is an issue only with dynamic subscriptions that get added after the fact - seems like queue state should be in a separate process or shared :ets table started started at the supervisor level so it survives restarts ( issue with dynamically adding / removing queues Apply error kernel pattern to manager state #117). Let me know if you had thoughts on one or the other.
Tests - a lot of the tests are integration "black box" tests, so some of them are timing dependent. They work in practice, but it would be great if we can remove the timeouts where possible (for example use receive with expected messages from workers where possible). I think it would also be great to have more unit level tests as well. While the integration tests can be brittle, I have found them valuable in detecting issues and they usually pass consistently. The failure condition / flakey connection tests are definitely brittle, and I don't think are as useful in a normal flow. I left them in there more as a guideline of what needs to be handled.
Lower max supervisor restarts as suggested
Syntax issues, and things like tuples in messages, unneeded implementation handle_info / terminate / stop should be able to be fixed pretty easily
"api.ex and enqueuer.ex contain a bunch of identical functions that do a GenServer.call or cast to an unspecified pid." - the enqueuer is a GenServer used for getting jobs into Redis, and can be fired up just for doing that in the case where you just need to pass messages to another system. Right now it's also doing double duty for accessing some of the stats in the UI, and that should definitely be split out.

Some comments:

The "record_process" in Stats is basically just for stats that show up in the dashboard. The job retries are done at the manager level: https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L120
I'm pretty sure configuration settings for things like Redis connection are passed down correctly and use Keyword with a Config as default. I did notice a few configuration settings like "timeout" that are not passed down correctly.
The :ets table was in use until one of the recent commits. I would be in favor of leaving it until most of the refactoring is done, in case it is needed.

After discussion, we can split these out into issues and can be there to be addressed.

Thanks!

akira · 2015-12-25T00:10:39Z

@benwilson512 Good suggestion about the documentation, I've started working on a branch, see #141.

I was also mistaken regarding the stats and retrying jobs, good catch, that definitely needs to be fixed. I can do that.

I was also wondering if you had thoughts about using a GenEvent server to broadcast events about stats / jobs vs directly sending to the different components? Seems like one of the complication is that Exq allows custom named processes, so not sure if it will simplify that a bit (I noticed a lot of the libraries don't do this and just do module name).

akira · 2016-01-05T05:22:56Z

Addressed by #146 and #147:

Job Workers use GenServer.start calls
Jobs restarting by worker instead of stats server

Addressed by #148 and #152:

Max restart lowered for Manager
Removed single argument in GenServer wrapped in tuple
Removed wrapping and unwrapping of values in GenServer init
Simplified where possible:
- :timer.sleep calls in tests
- Global state via mix config and process name registration in tests
duplication of api.ex and enqueuer.ex functions
removed handle_info use of Logger.error
removed catch all handle_call and handle_cast
removed callbacks provided by GenServer

Documentation provided by #141

benwilson512 · 2016-01-05T20:21:48Z

Hey! I apologize for taking so long to respond, the holidays required a lot of travel on my part and I wasn't able to make time.

Your comments have been really encouraging, and I'm confident that the project is in good hands. I have not had much time to make progress on my branches, but I hope to this week. I'm really excited to see the progress made so far!

akira · 2016-01-08T00:10:12Z

@benwilson512 no worries, hope you had a good holiday, and looking forward to the changes!

akira · 2016-01-29T02:22:29Z

I believe most of these have been addressed except these which are listed issues:

Typespecs / Dializer support #105 - Typespecs
Investigate / specify config values supported via opts vs config.ex #170 - Configs getting passed through opts vs taken from config.exs

Will close this issue unless any objections

benwilson512 added the Needs Discussion label Dec 23, 2015

This was referenced Dec 31, 2015

Worker supervision #147

Merged

Supervisor.start_link inside GenServer #129

Closed

Syntax cleanup #148

Merged

zhongwencool mentioned this issue Jan 15, 2016

refactor supervisor structure #158

Merged

This was referenced Jan 26, 2016

Centralize config.exs defaults #163

Closed

Investigate / specify config values supported via opts vs config.ex #170

Closed

akira closed this as completed Jan 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General Concerns #140

General Concerns #140

benwilson512 commented Dec 23, 2015

akira commented Dec 23, 2015

akira commented Dec 25, 2015

akira commented Jan 5, 2016

benwilson512 commented Jan 5, 2016

akira commented Jan 8, 2016

akira commented Jan 29, 2016

General Concerns #140

General Concerns #140

Comments

benwilson512 commented Dec 23, 2015

Severe Issues

General Issues

Going Forward

akira commented Dec 23, 2015

akira commented Dec 25, 2015

akira commented Jan 5, 2016

benwilson512 commented Jan 5, 2016

akira commented Jan 8, 2016

akira commented Jan 29, 2016